Generalized linear models

Computational Mathematics and Statistics Camp University of Chicago September 2018

Generalized linear models

  • Flexible class of models to estimate linear regression for response variables that have error distribution models other than the normal distribution
  • Typically estimated via maximum likelihood estimation

Elements of a GLM

  1. Random component - conditional distribution of the response variable \(Y_i\) given the values of the predictor variables
  2. Linear predictor

    \[\eta_i = \beta_{0} + \beta_{1}X_{i1} + \beta_{2}X_{i2} + \dots + \beta_{k}X_{ik}\]

  3. Link function \(g(\cdot)\)

    \[g(\mu_i) = \eta_i = \beta_{0} + \beta_{1}X_{i1} + \beta_{2}X_{i2} + \dots + \beta_{k}X_{ik}\] \[\mu_i = g^{-1}(\eta_i) = g^{-1}(\beta_{0} + \beta_{1}X_{i1} + \beta_{2}X_{i2} + \dots + \beta_{k}X_{ik})\]

    • Transforms the expectation of the response variable, \(\mu_i \equiv E(Y_i)\) to the linear predictor
    • Canonical link functions

GLMs and ordinary least squares regression

  • Least squares regression

    \[\boldsymbol{\beta} = \mathbf{(X^{'}X)^{-1}X^{'}Y}\]

  • OLS as a special case of a GLM

GLMs and ordinary least squares regression

\[Pr(Y_i = y_i) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left[\frac{(Y_i - \mu)^2}{2\sigma^2}\right]\]

Log-likelihood function

\[ \begin{aligned} \log L(\hat{\mu}, \hat{\sigma}^2 | Y) &= \log \prod_{i = 1}^{N}{\frac{1}{\sqrt{2\pi\sigma^2}} \exp \left[\frac{(Y_i - \mu)^2}{2\sigma^2}\right]} \\ \log L(\hat{\mu}, \hat{\sigma}^2 | Y) &= \sum_{i=1}^{N}{\log\left(\frac{1}{\sqrt{2\pi\sigma^2}} \exp \left[\frac{(Y_i - \mu)^2}{2\sigma^2}\right]\right)} \\ \log L(\hat{\mu}, \hat{\sigma}^2 | Y) &= -\frac{N}{2} \log(2\pi) - \left[ \sum_{i = 1}^{N} \log{\sigma^2 - \frac{1}{2\sigma^2}} (Y_i - \mu)^2 \right] \end{aligned} \]

Example data

Salaries of assistant professors
id salary
1 60
2 55
3 65
4 50
5 70

Example loglikelihood function

Adding predictor variables

\[ \begin{aligned} E(Y) &\equiv \mu = \beta_0 + \beta_{1}X_{i} \\ \mathrm{Var}(Y) &= \sigma^2 \end{aligned} \]

\[ \begin{aligned} \log L(\beta_0, \beta_1, \sigma^2 | Y) &= \log \prod_{i = 1}^{N}{\frac{1}{\sqrt{2\pi\sigma^2}} \exp \left[\frac{(Y_i - \beta_0 - \beta_{1}X_{i})^2}{2\sigma^2}\right]} \\ \log L(\beta_0, \beta_1, \sigma^2 | Y) &= -\frac{N}{2} \log(2\pi) \\ &\quad - \left[ \sum_{i = 1}^{N} \log{\sigma^2 - \frac{1}{2\sigma^2}} (Y_i - \beta_0 - \beta_{1}X_{i})^2 \right] \end{aligned} \]

Connecting MLE estimator to OLS

  • Kernal of the log-likelihood

    \[-\sum_{i = 1}^{N} \log{\sigma^2 - \frac{1}{2\sigma^2}} (Y_i - \beta_0 - \beta_{1}X_{i})^2\]

  • Residual sum of squared errors

    \[RSS = \sum_{i = 1}^{N} (Y_i - \beta_0 - \beta_{1}X_{i})^2\]

GLM for linear regression

  1. The random component is the normal distribution:

    \[Pr(Y_i = y_i) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left[\frac{(Y_i - \mu)^2}{2\sigma^2}\right]\]

  2. The linear predictor is:

    \[\eta_{i} = \beta_0 + \beta_{1}X_{i}\]

  3. The link function is the identity function:

    \[\eta_i = \mu_i\]

Why we use the log-likelihood

GLMs and logistic regression

\[Pr(Y_i = y_i | \pi_i) = \pi_i^{y_i} (1 - \pi_i)^{(1 - y_i)}\]

GLMs and logistic regression

\[\pi_i = \eta_i\]

  • In the linear context

    \[g(\pi_i) \equiv \eta_i = \beta_0 + \beta_{1}X_i\]

  • Logit link function

    \[g(\pi_i) = \frac{e^{\eta_i}}{1 + e^{\eta_i}}\]

GLMs and logistic regression

  • The random component is the Bernoulli distribution

    \[Pr(Y_i = y_i | \pi) = \pi_i^{y_i} (1 - \pi_i)^{(1 - y_i)}\]

  • The linear predictor is:

    \[\eta_i = \beta_0 + \beta_{1}X_i\]

  • The link function is the logit function:

    \[\eta_i = \log \left( \frac{\pi_i}{1 - \pi_i} \right)\]

GLMs and logistic regression

\[ \begin{aligned} L_i &= \left( \frac{e^{\eta_i}}{1 + e^{\eta_i}} \right) ^ {Y_i} \left[ 1 - \left( \frac{e^{\eta_i}}{1 + e^{\eta_i}} \right) \right]^{1 - Y_i} \\ L_i &= \left( \frac{e^{\beta_0 + \beta_{1}X_i}}{1 + e^{\beta_0 + \beta_{1}X_i}} \right) ^ {Y_i} \left[ 1 - \left( \frac{e^{\beta_0 + \beta_{1}X_i}}{1 + e^{\beta_0 + \beta_{1}X_i}} \right) \right]^{1 - Y_i} \\ L &= \prod_{i = 1}^{N} \left( \frac{e^{\beta_0 + \beta_{1}X_i}}{1 + e^{\beta_0 + \beta_{1}X_i}} \right) ^ {Y_i} \left[ 1 - \left( \frac{e^{\beta_0 + \beta_{1}X_i}}{1 + e^{\beta_0 + \beta_{1}X_i}} \right) \right]^{1 - Y_i} \\ \log L &= \sum_{i = 1}^{N} Y_i \log \left( \frac{e^{\beta_0 + \beta_{1}X_i}}{1 + e^{\beta_0 + \beta_{1}X_i}} \right) \\ &\quad + \sum_{i = 1}^{N}(1 - Y_i) \log \left[ 1 - \left( \frac{e^{\beta_0 + \beta_{1}X_i}}{1 + e^{\beta_0 + \beta_{1}X_i}} \right) \right] \end{aligned} \]

Logistic regression for more than two response classes

  • Roll-call vote outcomes
  • Binary response
    1. Yay
    2. Nay
  • Multiclass response
    1. Yay
    2. Nay
    3. Abstain
    4. Not present

Multinomial logistic regression

  • \(Y\) can take on any of \(m\) discrete values from \(1,2,\dots,m\)
  • Ordering does not matter
  • \(\pi_{ij}\) denotes the probability that the \(i\)th observation falls into the \(j\)th category of the response variable:

    \[\pi_{ij} \equiv \text{Pr}(Y_i = j), \text{ for } j = 1, \dots, m\]
  • For a model with \(k\) regressors, \(X_1, \dots, X_k\):

    \[\pi_{ij} = \frac{e^{[\gamma_{0j} + \gamma_{1j}X_{i1} + \dots + \gamma_{kj}X_{ik}]}}{1 + \sum_{l = 1}^{m-1} e^{[\gamma_{0l} + \gamma_{1l}X_{i1} + \dots + \gamma_{kl}X_{ik}]}}, \text{for } j = 1, \dots, m-1\]

    \[\pi_{im} = 1 - \sum_{i = 1}^{m-1} \pi_{ij}\]

    \[\sum_{j = 1}^{m} = 1\]

Multinomial logistic regression

\[\log \frac{\pi_{ij}}{\pi_{im}} = \gamma_{0j} + \gamma_{1j}X_{i1} + \dots + \gamma_{kj}X_{ik}, \text{for } j = 1, \dots, m\]

  • Dichotomous logit model

    \[\log \frac{\pi_{i1}}{\pi_{i2}} = \log \frac{\pi_{i1}}{1 - \pi_{i1}} = \gamma_{01} + \gamma_{11}X_{i1} + \dots + \gamma_{k1}X_{ik}\]

Applying multinomial logisitic regression

## Observations: 712
## Variables: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0,...
## $ Pclass      <int> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3,...
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra...
## $ Sex         <chr> "male", "female", "female", "female", "male", "mal...
## $ Age         <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, ...
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1,...
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0,...
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138...
## $ Fare        <dbl> 7.25, 71.28, 7.92, 53.10, 8.05, 51.86, 21.07, 11.1...
## $ Cabin       <chr> "", "C85", "", "C123", "", "E46", "", "", "", "G6"...
## $ Embarked    <chr> "S", "C", "S", "S", "S", "S", "S", "S", "C", "S", ...
  • Port of embarkation
    • C = Cherbourg
    • Q = Queenstown
    • S = Southampton

Port of embarkation

## # weights:  9 (4 variable)
## initial  value 782.211950 
## iter  10 value 423.965155
## final  value 422.773331 
## converged
## Call:
## multinom(formula = Embarked ~ Fare, data = titanic_multi)
## 
## Coefficients:
##   (Intercept)    Fare
## C       -1.96  0.0126
## Q       -2.64 -0.0159
## 
## Std. Errors:
##   (Intercept)    Fare
## C       0.133 0.00202
## Q       0.278 0.01082
## 
## Residual Deviance: 846 
## AIC: 854
##   (Intercept)  Fare
## C      -14.78  6.21
## Q       -9.47 -1.47
##   (Intercept)     Fare
## C           0 5.24e-10
## Q           0 1.41e-01

Port of embarkation: odds

##   (Intercept)  Fare
## C      0.1407 1.013
## Q      0.0717 0.984

Port of embarkation: probability

Port of embarkation: cumulative probability

Ordinal logistic regression

  • More than two ordered response values
  • Latent continuous variable

    \[\zeta_i = \beta_{0} + \beta_{1}X_{i1} + \dots + \beta_{k}X_{ik} + \epsilon_i\]

  • Divide into \(m\) regions using \(m-1\) thresholds

    \[ Y_i = \begin{cases} 1 & \text{if } \zeta_{i} \leq \alpha_{1} \\ 2 & \text{if } \alpha_{1} \lt \zeta_{i} \leq \alpha_{2} \\ \vdots & \\ m - 1 & \text{if } \alpha_{m - 2} \lt \zeta_{i} \leq \alpha_{m - 1} \\ m & \text{if } \alpha_{m - 1} \lt \zeta_{i} \\ \end{cases} \]

  • Log-odds of increasing to the next level
  • Difference between levels is constant

Passenger class

## Call:
## MASS::polr(formula = Pclass ~ Fare, data = titanic_ord, Hess = TRUE)
## 
## Coefficients:
##        Value Std. Error t value
## Fare -0.0958    0.00682   -14.1
## 
## Intercepts:
##     Value   Std. Error t value
## 1|2  -4.064   0.225    -18.102
## 2|3  -2.024   0.153    -13.257
## 
## Residual Deviance: 970.38 
## AIC: 976.38

Passenger class

Poisson distribution

\[P(Y_i = y_i | \mu) = \frac{\mu^{y_i} e^{-\mu}}{y_i!}\]

GLM and Poisson regression

  • The random component is the Poisson distribution:

    \[P(Y_i = y_i | \mu) = \frac{\mu^{k} e^{-y_i}}{y_i!}\]

  • The linear predictor is

    \[\eta_i = \beta_0 + \beta_{1}X_i\]

  • The canonical link function for the Poisson distribution is the log function

    \[\eta_i = \log(\mu)\]

  • Estimation:

    \[P(Y_i = y_i | \beta_0, \beta_1) = \frac{\log(\beta_0 + \beta_{1}X_i)^{k} e^{-y_i}}{y_i!}\]

Internal armed conflict in Africa

## Observations: 650
## Variables: 12
## $ ccode     <dbl> 404, 404, 404, 404, 404, 404, 404, 404, 404, 404, 40...
## $ year      <dbl> 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003...
## $ cabbr     <chr> "", "", "GNB", "", "GNB", "", "GNB", "", "", "", "",...
## $ INTERNAL  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ latitude  <dbl> 12.08, 12.08, 12.08, 12.08, 12.08, 12.08, 12.08, 12....
## $ longitude <dbl> -14.6, -14.6, -14.6, -14.6, -14.6, -14.6, -14.6, -14...
## $ literacy  <dbl> 44.0, 42.5, 41.0, 39.5, 38.0, 40.0, 42.0, 44.0, 46.0...
## $ refugees  <dbl> 15400, 15400, 16000, 6600, 7100, 7600, 7300, 7600, 7...
## $ lnRefs    <dbl> 9.64, 9.64, 9.68, 8.79, 8.87, 8.94, 8.90, 8.94, 8.94...
## $ lnGDP     <dbl> 6.45, 6.53, 6.57, 6.21, 6.26, 6.30, 6.27, 6.17, 6.13...
## $ lnTrade   <dbl> 3.85, 3.75, 4.11, 3.92, 4.21, 4.42, 4.52, 4.41, 4.35...
## $ PolityLag <dbl> NA, 5, 5, 5, 0, 3, 5, 5, 5, -1, -1, 6, 6, NA, -5, -5...
  • Count of internal armed conflicts
  • Measures of latitude and longitude
  • Adult literacy rate
  • Number of refugees living in the country
  • Lagged natural log of GDP and trade (in constant dollars)
  • Lagged polity score (ranging from -10 to +10, with 10 = democracy)

Internal armed conflict in Africa

Internal armed conflict in Africa

## 
## Call:
## glm(formula = INTERNAL ~ literacy, family = "poisson", data = africa)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.171  -0.684  -0.511  -0.411   2.808  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.000277   0.270673    0.00        1    
## literacy    -0.029070   0.005083   -5.72  1.1e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 405.45  on 563  degrees of freedom
## Residual deviance: 372.10  on 562  degrees of freedom
##   (86 observations deleted due to missingness)
## AIC: 584.2
## 
## Number of Fisher Scoring iterations: 6

Interpreting estimated parameters

Interpreting estimated parameters

## (Intercept)    literacy 
##       1.000       0.971

Multiple predictors

## 
## Call:
## glm(formula = INTERNAL ~ literacy + PolityLag, family = "poisson", 
##     data = africa)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.380  -0.677  -0.517  -0.342   2.803  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.11181    0.28678    0.39    0.697    
## literacy    -0.03063    0.00539   -5.68  1.4e-08 ***
## PolityLag   -0.04733    0.01946   -2.43    0.015 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 380.81  on 520  degrees of freedom
## Residual deviance: 341.96  on 518  degrees of freedom
##   (129 observations deleted due to missingness)
## AIC: 544.1
## 
## Number of Fisher Scoring iterations: 6

Over or underdispersion

  • Poisson distribution assumes mean and variance are identical
  • Violation of that assumption
  • Test for over/under dispersion

    \[V(Y_i | \eta_i) = \phi \mu_i\]

Over or underdispersion

## 
## Call:
## glm(formula = INTERNAL ~ literacy, family = "quasipoisson", data = africa)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.171  -0.684  -0.511  -0.411   2.808  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.000277   0.263033    0.00        1    
## literacy    -0.029070   0.004940   -5.89  6.8e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 0.944)
## 
##     Null deviance: 405.45  on 563  degrees of freedom
## Residual deviance: 372.10  on 562  degrees of freedom
##   (86 observations deleted due to missingness)
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

Over or underdispersion

## 
## Call:
## glm(formula = INTERNAL ~ literacy + PolityLag, family = "quasipoisson", 
##     data = africa)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.380  -0.677  -0.517  -0.342   2.803  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.11181    0.27866    0.40    0.688    
## literacy    -0.03063    0.00524   -5.84    9e-09 ***
## PolityLag   -0.04733    0.01891   -2.50    0.013 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 0.944)
## 
##     Null deviance: 380.81  on 520  degrees of freedom
## Residual deviance: 341.96  on 518  degrees of freedom
##   (129 observations deleted due to missingness)
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6