This is a Python implementation of Generalized Linear Models, inspired by the theoretical material taken in a Multivariate Statistics course and built based on object-oriented programming design principles. It's made as an attempt to understand the statistical concepts behind GLM, practice efficient OOP design, and write Python packages
"Matrices act. They don't just sit there."
― Gilbert Strang
beware some math and stats ahead :p
Linear models were concieved originally by Gauss in the aim of finding a relationship between a response variable
-
$y$ are independent -
$y$ are normally distributed with a mean$\mu$ so$y_i \sim N(\mu_i, \sigma^2)$ -
$\mu_i = X_i^T\beta$ , the mean$\mu$ are related to the predictors$X$ by a linear model (linear in the parameters$\beta$ , the$X$ can be transformed)
Then came the Generalized Linear Models (GLM) that are a generalization of the linear model, where the response variable
$y_i \sim exponentional\ family$ -
$g(\mu_i) = X_i^T\beta$ , the mean$\mu$ are related to the predictors$X$ by a link function$g$ (not necessarily linear, logit, log, inverse...).
In regular linear models, can use Least Squares or Maximum Likelihood Estimation to estimate the coefficients
Linear Model | Generalized Linear Model |
---|---|
solved by OLS and MLE | solved by MLE |
In linear models, estimating the parameters come from solving one of these optimization problems:
-
Least Squares: Optimization problem where we wanna minimize the sum of squared residuals,
$min \sum_{i=1}^{n} (y_i - \hat y_i)^2$ , where$\hat y_i = X_i^T\beta$ ; solved by differentiating and equating to 0. -
Maximum Likelihood: Assuming some distribution on
$y$ , in LM$y \sim N(\mu, \sigma^2)$ , where$\mu_i = X_i^T\beta$ sometimes$\mu_i$ will be above and sometimes below the line.
Probability of obtaining that$y_i$ is $ \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i-\mu_i)^2}{2\sigma^2}}$, we want to maximize this probability (choose the$\beta's$ that do), so we take the log likelihood, it's be a sum of the logs:
$l(\beta) = \sum_{i=1}^{n} log(\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i-\mu_i)^2}{2\sigma^2}})= \sum_{i=1}^{n} log(\frac{1}{\sqrt{2\pi\sigma^2}}) - \sum_{i=1}^{n} \frac{(y_i-\mu_i)^2}{2\sigma^2}$
The goal is to maximize this quantity
$\iff$ maximize$- \sum_{i=1}^{n} \frac{(y_i-\mu_i)^2}{2\sigma^2}$ WHICH IS
$\iff$ minimize$\sum_{i=1}^{n} (y_i - \mu_i)^2$
$\iff$ minimize the sum of squared residuals (same as OLS).
Least Squares | Maximum Likelihood |
---|---|
check here for more comparisons.
Generalized Linear Models (GLM) are a class of models including linear, logistic and poisson regression, among others. These are generalizations of the linear model, following one of the exponential family distribution.
The aim of these models is to model the relationship between the response variable
Hence, the model is defined as
The parameters of the model are estimated by maximizing the likelihood of the observed data given the model. The likelihood is a function of the coefficients and is maximized by differentiating and equating to 0. This is equivalent to minimizing the loss function, which is a function that measures the difference between the predicted value and the actual value of the response variable (litteraly, negative log likelihood). The model is usually evaluated by deviance, a measure of the goodness of fit of the model and is used to compare the performance of different models.
In a linear model,
3 main assumptions in linear models:
(1) This was based on 3 assumptions:
-
$\epsilon_i$ are independent (ensures that$\hat\beta$ are unbiased estimators of the true coefficients by having$\mathbb{E}(\epsilon) =0 \iff \mathbb{E}(y) = X\beta$ ), this makes the off-diagonal elements of the covariance matrix of$\hat\beta$ to be 0 -
$\epsilon_i$ are identically distributed (ensures that the variance of the residuals is constant, and it's$\sigma^2$ ) -
$\epsilon \sim N(0, \sigma^2)$ (ensures that the residuals are normally distributed)
So i.i.d
This is mind,
Let
The least squares approach allows to get the
The residuals are the difference between the observed value of the response variable and the predicted value of the response variable,
In this fig it's worth noting the 2 types of errors we have:
-
error of prediction (
$\hat \epsilon$ , predicting a new y from a new obs of X will have an error that's based on$X\hat\beta$ and$y$ ) -
error of estimation (diff between
$X\beta\ and\ X \hat \beta$ , when we do estimation we don't know real values of$\beta$ , we will estimate$\hat\beta$ and predicted value will be$X\hat\beta$ )
$predicted\ error \gt estimation\ error$ .
To estimate the coefficients of the model, we minimize the residual sum of squares (RSS), which is the sum of the squared residuals,
Enters the normal equation that gives us the
So
where
-
$p>n$ , we have more predictors than observations, so we can't estimate the coefficients, the rank of$X \lt p$ thus$X^TX$ is not invertible - When the predictors are linearly dependent, the rank of
$X \lt p$ thus$X^TX$ is not invertible
To deal with 1, use regularization (L1, L2, elastic net), or dimensionality reduction (PCA), or feature selection (backward, forward, stepwise selection).
To deal with the 2nd case, we can either remove colinear predictors or (do smtg else i forgot what). When
- comparing ratio between the largest and smallest eigenvalues of the corr matrix R (the smallest eigenvalue is 0 when it's not full rank), if it's very large, then we have multicollinearity, there is no stat test to check if it's significantly greater than 0 but some empirical values are used to check for it like 500 or 1000 (arbitrary threshold); thus when ration is greater than this value, we have strong colinearity.
- perform several Linear regressions between predictors and check the
$R^2$ values, if they are close to 1. i.e., try to explain one predictor with the others, the model that shows a near 1$R^2$ means that this predictor is very well explained by the others, it's a linear combination of them thus it's a good candidate to be removed. (e.g., explain x1 by x2 & x3, then x2 by x1 & x3, then x3 by x1 & x2, check for$R^2$ values)
VIF (Variance Inflation Factor) is used here, as$VIF = \frac{1}{1-R^2}$ , when$R^2 \approx 1$ , then$VIF \approx \infty$ , and when$R^2 \approx 0$ , then$VIF \approx 1$ (compute a VIF for each predictor)
Testing the significance of a coefficient:
-
Null hypothesis: the coefficient is not significant,
$H_0: \beta^*_j = 0$ -
Alternative hypothesis: the coefficient is significant,
$H_1: \beta^*_j \neq 0$ - Test statistic
$T_n = \frac{\hat \beta_j}{\hat S(\hat \beta_j)}$ , converging to a standard normal distribution$N(0,1)$ as$n \to \infty$ ;- it's in fact a z-score, and it follows a t-distribution with$n-p-1$ degrees of freedom. - Wald test (Student)
Comparing nested models:
-
Null hypothesis: the coefficient is not significant, $H_0: \beta^_j = 0$ del), we can compare the performance of the two models by performing an anova test. Let $\sigma^2$ be the variance of the residuals of the big model
$M_q$ , and$\sigma^2$ be the variance of the residuals of the smaller model$M_{q'}$ . -
Alternative hypothesis: the coefficient is significant,
$H_1: \beta^*_j \neq 0$ - Test statistic
$T_n = \frac{\hat \beta_j}{\hat S(\hat \beta_j)}$ , converging to a standard normal distribution$N(0,1)$ as$n \to \infty$ ;- it's in fact a z-score, and it follows a t-distribution with$n-p-1$ degrees of freedom. sis_: the full model$M_q$ is better,$H_1: \sigma^2 \lt \sigma^2*$ (for the nested model) - Wald test (Student) - Test statistic is the likelihood ratio test,
$LR = -2(log(L(M_{q'})) - log(L(M_q)))$ , which follows a chi-squared distribution with$q-q'$ degrees of freedom.
test) Comparing nested models:
Given$M_q$ and$M_{q-1}$ where$M_q$ has$q$ predictors (full) and$M_{q'}$ has$q'$ predictors (reduced nested model), we can compare the performance of the two models by performing an anova test. Let$\sigma^2*$ be the variance of the residuals of the big model$M_q$ , and$\sigma^2$ be the variance of the residuals of the smaller model$M_{q'}$ . #### Evaluation -
Null hypothesis: the simpler model
$M_{q'}$ is better,$H_0: \sigma^2 = \sigma^2*$ (as if the added predictors are not significant) -
Alternative hypothesis: the full model
$M_q$ is better,$H_1: \sigma^2 \lt \sigma^2*$ (for the nested model) - Test statistic is the likelihood ratio test,
$LR = -2(log(L(M_{q'})) - log(L(M_q)))$ , which follows a chi-squared distribution with$q-q'$ degrees of freedom. - Devian test (F-test)
But what is
Some definitions:
-
TSS: total variability in the response variable,
$TSS = \sum_{i=1}^{n} (y_i - \bar y)^2$ (inherent variability in the response variable) -
ESS: variability explained by the model,
$ESS = \sum_{i=1}^{n} (\hat y_i - \bar y)^2$ (variability explained by the model), how much the model was able to explain the variability in the response variable (to get from TSS) -
RSS: variability not explained by the model,
$RSS = \sum_{i=1}^{n} (y_i - \hat y_i)^2$ (variability not explained by the model) -
$R^2$ : proportion of variability that can be explained using X, the closer to 1 the better the model is,
proportion of variance explained by the model$R^2 = \frac{ESS}{TSS}$ , where$ESS$ is the explained sum of squares and$TSS$ is the total sum of squares. R^2$ for the number of predictors in the model,$R^2_{adj} = 1 - \frac{RSS/(n-p-1)}{TSS/(n-1)}$ , where$p$ is the number of predictors in the model. When the number of predictors increases, the$R^2$ also increases, even if the predictors are not significant. It's mainly used to compare the performance of models with different numbers of predictors (because when p increases, the variance of the model increases, which is translated in the equation by the increase of the RSS(?), so the adjusted$R^2$ will decrease) RSS is the residual sum of squares,$RSS = \sum_{i=1}^{n} (y_i - \hat y_i)^2$ ,$ESS = \sum_{i=1}^{n} (\hat y_i - \bar y)^2$ , and$TSS = \sum_{i=1}^{n} (y_i - \bar y)^2$ , where$\bar y$ is the mean of the response variable.
In here$RSS$ is the error that the model makes, also seen as the projection of$y$ onto the space of$X\beta$ . The smaller the distance is, the closest is our$\hat y$ to$y$ , the better the model is and the more$R^2 \approx 1$
-
probability of getting this y given a particular model, pictured as an area under the curve of a fixed distribution:
$Pr(data/distribution)$ -
Adjusted
$R^2$ : penalizes the$R^2$ for the number of predictors in the model,$R^2_{adj} = 1 - \frac{RSS/(n-p-1)}{TSS/(n-1)}$ , where$p$ is the number of predictors in the model. When the number of predictors increases, the$R^2$ also increases, even if the predictors are not significant. It's mainly used to compare the performance of models with different numbers of predictors (because when p increases, the variance of the model increases, which is translated in the equation by the increase of the RSS(?), so the adjusted$R^2$ will decrease) known data, pictured as y-axis points of distributions that can be moved:$L(distribution/data)$ (more here)
Likelihood is not probability
- probability of getting this y given a particular model, pictured as an area under the curve of a fixed distribution:
$Pr(data/distribution)$ the probability of observing the data to estimate the parameters given that it follows a particular distribution, with the assumption that the data is independent and identically distributed (i.i.d). - likelihood is about fitting a model given known data, pictured as y-axis points of distributions that can be moved:
$L(distribution/data)$ (more here)=1}^{n} f(y_i|\mu_i)$, we want to maximize$L$ to get the$\beta$ that best explains the data.
![likelihood vs probability statquest note]utions.
-
binomial distribution, the log likelihood is
$l = \sum_{i=1}^{n} y_i log(\mu_i) + (1-y_i)log(1-\mu_i)$ , where$y_i$ is the number of successes in$n$ trials, and$\mu_i$ is the probability of success.
The maximum likelihood is an optimization problem revolving around maximizing the likelihood of the observed data given the model. "Likelihood" is a term often used in tandem with "probability", but they are not the same. - poisson distribution, the log likelihood is$l = \sum_{i=1}^{n} y_i log(\mu_i) - \mu_i$ and$\mu_i$ here is the$\lambda$ parameter of the poisson distribution so$\mu_i = e^{X_i^T\beta}$ and y is the number of events in a fixed interval of time. By defintion, the likelihood is a measure of how well a statistical model explains the observed data. In here this optimization problem is about maximizing the probability of observing the data to estimate the parameters given that it follows a particular distribution, with the assumption that the data is independent and identically distributed (i.i.d).
In other words having$f(y_i|\mu_i)$ , the probability of observing$y_i$ given$\mu_i$ with$f$ being the distribution, and$\mu_i = X_i^T\beta$ , the probability of observing$y_i$ given$X_i$ and$\beta$ , the likelihood of observing the data is$L = \prod_{i=1}^{n} f(y_i|\mu_i)$ , we want to maximize$L$ to get the$\beta$ that best explains the data.
This product is usually hard to work with, so we take the log likelihood,$l = \sum_{i=1}^{n} log(f(y_i|\mu_i))$ , and we want to maximize this quantity. ons: -
gaussian distribution, the log likelihood is
$l = \sum_{i=1}^{n} log(\frac{1}{\sqrt{2\pi\sigma^2}}) - \sum_{i=1}^{n} \frac{(y_i-\mu_i)^2}{2\sigma^2}$ is equivalent the OLS problem, making this a generalized approach for different types of distributions. -
binomial distribution, the log likelihood is
$l = \sum_{i=1}^{n} y_i log(\mu_i) + (1-y_i)log(1-\mu_i)$ , where$y_i$ is the number of successes in$n$ trials, and$\mu_i$ is the probability of success. -
poisson distribution, the log likelihood is
$l = \sum_{i=1}^{n} y_i log(\mu_i) - \mu_i$ and$\mu_i$ here is the$\lambda$ parameter of the poisson distribution so$\mu_i = e^{X_i^T\beta}$ and y is the number of events in a fixed interval of time.
The linear predictor is defined as $ \eta = X\beta$, where
function is chosen based on the distribution of the response variable. For example, in the case of the gaussian distribution, the link function is the identity function, while for the Bernoulli distribution, the link function is the logit function.
The link function arises from ML
What distinguishes GLM from the linear model is the link function that connects the linear predictor to the expected value of the response variable. i^T\beta$ |
The linear predictor is defined as $ \eta = X\beta$, where
LM | GLM |
---|---|
lli distribution with some probability p:
This link function has to be the most common link function is the logit function
- monotonic (so that the relationship between the predictors and the response variable is preserved)
-
differentiable (so that we can estimate the coefficients of the model)
$\iff \mu_i = \frac{e^{X_i^T\beta}}{1+e^{X_i^T\beta}} = \frac{1}{1+e^{-X_i^T\beta}}$ - invertible (so that we can get back to the mean from the linear predictor).
In the case of binary regression, it's the Bernoulli distribution with some probability p: anformed.
| Bernoulli | Logit |
Distribution | Link function name | Mathematical expression | prediction | |
---|---|---|---|---|
Gaussian | Identity |
|
||
Bernoulli | Logit |
|
||
Poisson | Log |
|
In each model, the coefficients are estimated by minimizing the loss function. The linear model uses Ordinary Least Squares (OLS) to estimate the coefficients, while the logistic and poisson models use Maximum Likelihood Estimation (MLE) to estimate the coefficients. The coefficients are estimated by maximizing the likelihood of the observed data given the model. The likelihood is a function of the coefficients and is maximized by finding the coefficients that maximize the likelihood of the observed data.
\beta$ that optimizes the performance of the model, i.e. minimizes the loss function or maximizes the likelihood function. (loss function is the negative log likelihood)
Another important concept in GLM is the loss function, which is a function that measures the difference between the predicted value and the actual value of the response variable. The loss function is used to estimate the coefficients of the model by minimizing the difference between the predicted value and the actual value of the response variable. It's usually chosen based on the distribution of
In each model, the coefficients are estimated by minimizing the loss function. The linear model uses Ordinary Least Squares (OLS) to estimate the coefficients, while the logistic and poisson models use Maximum Likelihood Estimation (MLE) to estimate the coefficients. The coefficients are estimated by maximizing the likelihood of the observed data given the model. The likelihood is a function of the coefficients and is maximized by finding the coefficients that maximize the likelihood of the observed data.
So our goal is to get the
An important aspect of building a GLM is to understand the statistical properties of the model, like residuals, deviance, and confidence intervals.
Evaluating the performance of a model, we often care to see which predictors are significant, and which are not. This is done by performing hypothesis tests on the coefficients of the model.
l rank assumption, if
We can compare between models and nested models by anova which comes to compare the deviance of the models. The deviance is a measure of the goodness of fit of the model and is used to compare the performance of different models. Other measures of goodness of fit include the AIC and BIC, and even
m a classical linear model, we can add a penalty term to the loss function to prevent overfitting, this is called regularization.
Motivation:
When validating the full rank assumption, if
Even if p is not greater than n but is close to it, teh huge number of predictors increases teh variance of the model and makes the statistical tests less powerful. there are variable selection algorithms for that but with
So that is a problem. ^Ty$$
Starting from a classical linear model, we can add a penalty term to the loss function to prevent overfitting, this is called regularization.Penalized (ridge example):
$$ RSS(\lambda) = (y - X\beta)^T(y - X\beta) + \lambda \beta^T\beta$$
classical:
$$ RSS = (y - X\beta)^T(y - X\beta)$$
$$ \hat \beta = (X^TX)^{-1}X^Ty$$
Penalized (ridge example):
$$ RSS(\lambda) = (y - X\beta)^T(y - X\beta) + \lambda \beta^T\beta$$ So in other words:
imizes the RSS, so the solution is the
Ridge$$\hat \beta^{ridge} = argmin_{\beta} RSS(\lambda)= argmin_{\beta} \sum_{i=1}^{n} (y_i - X_i^T\beta)^2 + \lambda \sum_{j=1}^{p} \beta_j^2$$
So in other words:
Now the formula mentioned above for linear models can be generalized by using negative log likelihood instead of least squares: ther type of regularization that uses the L1 norm of the coefficients as the penalty term (absolute values of teh estimates). The lasso is used to perform variable selection by setting some of the coefficients to zero. In practice this is the main difference between the lasso and ridge regression (considered as an advantage of the lasso).
rm, so it's a mix of the two. (associate the penalty term with a weight
The lasso is another type of regularization that uses the L1 norm of the coefficients as the penalty term (absolute values of teh estimates). The lasso is used to perform variable selection by setting some of the coefficients to zero. In practice this is the main difference between the lasso and ridge regression (considered as an advantage of the lasso).(y - X\beta) + \lambda \alpha \sum_{j=1}^{p} |\beta_j| + \lambda (1-\alpha) \sum_{j=1}^{p} \beta_j^2$$
- if
$\alpha = 0$ , it's ridge
Combination of lasso and ridge
Norm 1 and norm 2 in penalty term, so it's a mix of the two. (associate the penalty term with a weight
- if
$\alpha = 0$ , it's ridge - if
$\alpha = 1$ , it's lasso
choosing $\lambda$:
- Cross validation: split the data into training and validation sets, train the model on the training set, and evaluate the performance of the model on the validation set. Repeat this process for different values of
$\lambda$ and choose the value of$\lambda$ that gives the best performance on the validation set. The idea of bias comes from something in this example: - Estimates will depend on this choice, we plot them as a function of
$\lambda$ (regularization path) and choose the one that gives the best performance. Notice how the coefficients are shrunk towards zero as$\lambda$ increases.he mean) as estimator of$\mu$ , for some observations we will have$\hat \mu \gt \mu$ (overestimating) and for others$\hat \mu \lt \mu$ (underestimating), but avergaing on large sets of observations we expect$\hat \mu = \mu$ , so the estimator is unbiased, "unbiased estimator does not systematically over- or under-estimate the true parameter" (ISLR). This property holds in least squares, where coefficients are unbiased estimators of the true coefficients, estimating coef on one dataset wont make our estimates exactly equal to the true values, but averaging on large sets of datasets will make them equal.
Some other important concepts in statistics
The idea of bias comes from something in this example: iagonal, and the covariances between the variables on the off-diagonal. It's a measure of how much two variables change together.
If we take
alues of
The covariance matrix is a square matrix that contains the variances of the variables on the diagonal, and the covariances between the variables on the off-diagonal. It's a measure of how much two variables change together. 2$, it's a measure of how much
In mathematical formulation, the covariance matrix of a random vector
- $X = \begin{bmatrix} X_1 \ X_2 \end{bmatrix}$, where
$X_1$ and$X_2$ are random variables - $\mu = \begin{bmatrix} \mu_1 \ \mu_2 \end{bmatrix}$, where
$\mu_1$ and$\mu_2$ are the means of$X_1$ and$X_2$ - Let$A$ be a Linear transformation matrix, then -
$Cov(X_1,X_1)=Var(X_1) =\mathbb{E}[(X_1-\mu_1)(X_1-\mu_1)]$ =$\mathbb{E}[(X_1-\mu_1)^2]$ is the variance of$X_1$ (formula of a variance from the first statistics course), by definition it's the average of the squared differences between the values of$X_1$ and the mean of$X_1$ , where a large variance means that the values of$X_1$ are spread out having values far from the mean,T$, this is a property of the covariance matrix, it's invariant under linear transformations (derivation:) probabilitycourse.com -
$Cov(X_1,X_2) = \mathbb{E}[(X_1-\mu_1)(X_2-\mu_2)]$ is the covariance between$X_1$ and$X_2$ , it's a measure of how much$X_1$ and$X_2$ change together, if it's positive, then when$X_1$ is above its mean,$X_2$ is also above its mean, and if it's negative, then when$X_1$ is above its mean,$X_2$ is below its mean. - The cov formula above, when expanded for a matrix to include all the variables, gives the covariance matrix also denoted
$Var(X)$ (ESLR) or even$\Sigma$ (ISLR). - $\Sigma = \begin{bmatrix} Var(X_1) & Cov(X_1,X_2) \ Cov(X_2,X_1) & Var(X_2) \end{bmatrix}$see this
- Let
$A$ be a Linear transformation matrix, then
$Var(AX) = A\Sigma A^T$ , this is a property of the covariance matrix, it's invariant under linear transformations (derivation:)
![cov between 2 vars]
see this
unction
exponent), and the
- The $\sigma^2$ term in the denominator of the exponent is the variance to make it evenly distributed, and controls the spread of the curve, it's put in the denominator because the exponent is negative, this the variance becomes directly proportional to the spread of the curve
A refresher on some distributions - The
- The gaussian distribution is characterized by the probability density function
$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ , where$\mu$ is the mean and$\sigma^2$ is the variance. Here is a breakdown of this function because it's very nice to look at; (source):-
The
$e^{-\frac{1}{2}}$ term gives it its shape, it's a bell curve. The exponent is negative, so that the curve is concave down (particularly at$\mu ± \sigma$ , this is why the 2 is added in denominator to scale it here instead of 2 times this value - check other elements of the exponent), and the$e$ is there to make it positive, so that the curve is always positive. y$ | -
The
$(x-\mu)^2$ term is the exponent centered around the mean$\mu$ -
The
$\sigma^2$ term in the denominator of the exponent is the variance to make it evenly distributed, and controls the spread of the curve, it's put in the denominator because the exponent is negative, this the variance becomes directly proportional to the spread of the curve -
The
$\frac{1}{\sqrt{2\pi\sigma^2}}$ term is the normalization constant, it ensures that the distribution is indeed a probability density function, i.e.$\int_{-\infty}^{\infty} e^{-\frac{(x-\mu)^2}{2\sigma^2}}dx = \sqrt{2\pi\sigma^2}$ , and multiplying by the$\frac{1}{\sqrt{2\pi\sigma^2}}$ term ensures that the area under the curve is one ($\int_{-\infty}^{\infty} f(x)dx = 1$ ) | f'(x) | + | + | 0 | - | - | | - | 0 | + || x |
$-\infty$ |$\mu - \sigma$ |$\mu$ |$\mu + \sigma$ |$+\infty$ | |----------|---------|----------------|-------|----------------|---------|- The binomial distribution is characterized by the probability mass function$f(y) = \binom{n}{y}p^y(1-p)^{n-y}$ , where$n$ is the number of trials,$y$ is the number of successes, and$p$ is the probability of success. The binomial distribution is used to model the number of successes in a fixed number of trials | f(x) | 0 | less than max | max | less than max | 0 | is characterized by the probability mass function$f(y) = \frac{\lambda^y}{y!}e^{-\lambda}$ , where$\lambda$ is the average number of events (also std. dev.) in a fixed interval of time. The poisson distribution is used to model the number of events in a fixed interval of time | f'(x) | + | + | 0 | - | - | | f''(x) | + | 0 | - | 0 | + |
-
- A probability density function is a function that describes the likelihood of a continuous random variable taking on a particular value. The area under the curve of a probability density function =1
- A probability mass function is a function that describes the likelihood of a discrete random variable taking on a particular value. The sum of the probabilities of all possible values of a discrete random variable =1
- The binomial distribution is characterized by the probability mass function
$f(y) = \binom{n}{y}p^y(1-p)^{n-y}$ , where$n$ is the number of trials,$y$ is the number of successes, and$p$ is the probability of success. The binomial distribution is used to model the number of successes in a fixed number of trials - The poisson distribution is characterized by the probability mass function
$f(y) = \frac{\lambda^y}{y!}e^{-\lambda}$ , where$\lambda$ is the average number of events (also std. dev.) in a fixed interval of time. The poisson distribution is used to model the number of events in a fixed interval of time
note on density vs mass:
- A probability density function is a function that describes the likelihood of a continuous random variable taking on a particular value. The area under the curve of a probability density function =1 ### Exponential Family
- A probability mass function is a function that describes the likelihood of a discrete random variable taking on a particular value. The sum of the probabilities of all possible values of a discrete random variable =1
dom distribution(?)
ctation of paramater estimator is 0
For any statistical test we have a random variable following a random distribution(?)Motivation:
question of interest here: expectation of paramater estimator is 0
errors.
Motivation: FP \iff False\ discovery$
When we do multiple hypotehsis tests
f FN$
So problem: too many type I errors. ests
Some terminology to note:
-
type I error: rejecting a true null hypothesis when it's true
$\iff FP \iff False\ discovery$
$\alpha = P(reject\ H_0|H_0\ is\ true)$ = rate of false discoveries ry rate, the expected proportion of false positives among the rejected hypotheses -
type II error: failing to reject a false null hypothesis
$\iff FN$ $FDR = E(\frac{V}{R}|R>0)$ , where$V$ is the number of false positives and$R$ is the number of rejected hypotheses. -
FWER: family-wise error rate, the probability of making at least one type I error in a family of tests
$FWER = P(V \ge 1)$ , where$V$ is the number of false positives. -
FDP: false discovery proportion, the proportion of false positives among the rejected hypotheses
$FDP = \frac{V}{max(R, 1)}$ , where$V$ is the number of false positives and$R$ is the number of rejected hypotheses. -
FDR: false discovery rate, the expected proportion of false positives among the rejected hypotheses |
$m_0$ |
$FDR = E(\frac{V}{R}|R>0)$ , where$V$ is the number of false positives and$R$ is the number of rejected hypotheses. |$H_0$ false | T=type II = FN | S |$m_1$ | - Confusion table:
| | Accept
|----------|--------------|--------------|--|
|
|
| total | W | R |
-
tolerance: the proportion of false positives that are acceptable Assuming tests are independent
$FWER= P(V \ge 1)= 1 - P(no\ type\ I\ errors)=1 - (1-\alpha)^m$
$tolerance = \frac{V}{R}$ ? n we plot$1 - (1-0.05)^m$ , we see that it increases very fast (exponentially). Changing alpha doesn't really help as the shape of teh cruve will be the same, so we need to correct for this. -
power: the probability of rejecting a false null hypothesis
$power = 1 - P(type II error)$ ![plot of the function]
it's more powerful when we make more true discoveries_the probability of making at least one type I error is $1-(1-0.05)^{50} \approx 0.92$ (near 100% chance of making a type I error in m tests)
Assuming tests are independent
Taking
calized comparisons: k-1 if one control group, k(k-1)/2 if all pairwise comparisons
![plot of the function]
e.g., type I error rate $\alpha$=0.05, if we do 50 tests, the probability of making at least one type I error is $1-(1-0.05)^{50} \approx 0.92$ (near 100% chance of making a type I error in m tests)
Distinguish between different types of multiplicities: e I error in a family of tests, this will not only decrease the number of false discoveries but also the number of true discoveries (very few
$H_0$ will be rejected), making the procedure too conservative for large m ($ESLII$ )
1-2 sample problems with m different endpoints (m=p in linear regression), this means m statistical tests for m parameters
k-sample problems with localized comparisons: k-1 if one control group, k(k-1)/2 if all pairwise comparisons
Makes each test more stringent to make$FWER \le \alpha$ , so$\alpha/m$ is the new$\alpha$ for each test.
Controling the FWER t a predefined $\alpha/m$ level, regardless of the observations (same level for the m tests)
Controling means making at least one type I error in a family of tests, this will not only decrease the number of false discoveries but also the number of true discoveries (very few
$H_0$ will be rejected), making the procedure too conservative for large m ($ESLII$ )
in test j)$\le \sum_{i=1}^{m} P$ (..)
Bonferroni (single step method): making type I error in test j) = Makes each test more stringent to make$FWER \le \alpha$ , so$\alpha/m$ is the new$\alpha$ for each test.$\alpha/m + \alpha/m + ... + \alpha/m = \alpha$
It's single step because each test defines at a predefined$\alpha/m$ level, regardless of the observations (same level for the m tests)
Remember
$P(A \cup B) \le P(A) + P(B)$ Holm (stepwise: step-down method):
$FWER = \bigcup_{i=1}^{m} P$ (making type I error in test j)$\le \sum_{i=1}^{m} P$ (..) more discoveries - at least as many as Bonferroni)
with$\sum_{i=1}^{m} P$ (making type I error in test j) =$\alpha/m + \alpha/m + ... + \alpha/m = \alpha$ (m)}$
Hence,$FWER \le \alpha$
- reach index
$i$ where$p_{(i)} \gt \alpha/(m-i+1) \implies$ fail to reject
Holm (stepwise: step-down method):
Also guarantees$FWER \le \alpha$ , BUT it's more powerful than Bonferroni (allows for more discoveries - at least as many as Bonferroni)
is decreasing in each step (starting with a large set of hypothesis that's going down). It ensures that all tests that would be rejected by Bonferroni are also rejected by Holm. It is a strong control (i.e., it's conservative) but it's more powerful than Bonferroni. - sort p-values from smallest to largest,
$p_{(1)} \le p_{(2)} \le ... \le p_{(m)}$ - compare
$p_{(j)}$ with$\alpha/(m-j+1)$ d pval plot] - reach index
$i$ where$p_{(i)} \gt \alpha/(m-i+1) \implies$ fail to reject - All remaining p-values are also failed to reject Hochberg (step-up method):
onferroni & Holm, HOWEVER, its main drawback is it relying on a strong assumption that the tests are independent (there is a cost for this power).
It's called step-down because the denominator is decreasing in each step (starting with a large set of hypothesis that's going down). It ensures that all tests that would be rejected by Bonferroni are also rejected by Holm. It is a strong control (i.e., it's conservative) but it's more powerful than Bonferroni. argest p-value and goes down_
![bonferroni & holm index vs sorted pval plot]} \ge p_{(m-1)} \ge ... \ge p_{(1)}$
Hochberg (step-up method):
Also guarantees
_{(m-i+1)} \le \alpha/i$
Same principle but starts from the largest p-value and goes down
- sort p-values from largest to smallest,
$p_{(m)} \ge p_{(m-1)} \ge ... \ge p_{(1)}$ - compare
$p_{(m)}$ with$\alpha$ :
a. if$p_{(m)} \le \alpha$ , reject all$H_0$ Here we have a different guarantee, we want to control the proportion of false discoveries among the rejected hypotheses
b. else compare$p_{(m-1)}$ with$\alpha/2$ get some discoveries and experiment with them, so we need a less conservative method, enters FDR.
c. continue until$p_{(m-i+1)} \le \alpha/i$
number of rejected hypotheses.
Controling the FDR We're controling the rate of false discoveries to be less than alpha rather than the probability of making at least one error in m tests (FWER) < $\alpha$ (we might take $\alpha$ to be something liek 10 or 20%)
TP}$ and
Here we have a different guarantee, we want to control the proportion of false discoveries among the rejected hypotheses
As said previously, FWER is too conservative for large m, we might not have any discoveries at all. In some contexts (liek GWAS), we want to get some discoveries and experiment with them, so we need a less conservative method, enters FDR.
gest,
again by definition:
We're controling the rate of false discoveries to be less than alpha rather than the probability of making at least one error in m tests (FWER) <
In a programming context $FDR=\frac{FP}{FP+TP}$ and $precision=\frac{TP}{TP+FP}$, so $FDR=1-precision$.
![plot comparing curves with FWER approaches]
Benjamini-Hochberg (step-up method):
- sort p-values from smallest to largest,
$p_{(1)} \le p_{(2)} \le ... \le p_{(m)}$ - find the largest index
$i$ such that$p_{(i)} \le \frac{i}{m}\alpha$ - reject all
$H_0^j$ for$j \le i$ es
![plot comparing curves with FWER approaches]
xtbook by Hastie, Tibshirani, and Friedman
hastie.su.domains/ISLR/) textbook by James, Witten, Hastie, and Tibshirani
MYxkyZX0ti) youtube playlist by Meerkat Statstics
Sources: Prof. Gilbert Strang for linear algebra
- Elements of Statistical Learning textbook by Hastie, Tibshirani, and Friedman
- Introduction to Statistical Learning textbook by James, Witten, Hastie, and Tibshirani- Data Science Methods and Statistical Learning - U Toronto by Prof. Samin Aref (particularily MT correction)
- Introduction to Generalized Linear Models textbook by Dobson and Barnett
- Linear Models with R textbook by Faraway
Licensed by the MIT License.Visualizations drawn with - DataMListic youtube channel, particulalry for gaussian distribution- Data Science Methods and Statistical Learning - U Toronto by Prof. Samin Aref (particularily MT correction) - StatQuest youtube channel by Josh Starmer (amazingness) - MIT 18.065: Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, lectures on youtube by Prof. Gilbert Strang for linear algebra- Generalized Linear Models (GLM's) youtube playlist by Meerkat Statstics
Library design and optimization is under construction
Licensed by the MIT License.