Bernoulli and Binomial Models

In the last tutorial, we learned how to program and estimate linear models in Stan. In this tutorial, we'll learn how to estimate binary outcome models commonly refered to as Logit and Probit.

Let's get right into it.

Logit with the Bernoulli Distribution

Import Data

This time, we'll use a data on graduatate school applicants. This data includes each applicant's GRE score, GPA, the rank of their undergraduate school, and whether or not they were admitted to graduate school. Our goal is to predict the probability that a student will be admitted.

gre.data <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

library(tidyverse)
glimpse(gre.data)

We could predict whether or not a student is admitted with a Gaussian likelihood function, but then we would be assuming that the outcome is continuous and that both positive and negative values are possible. These are bad assumptions. This data is generated by a process that only produces positive and discrete outcomes. Each application either results in an admit or a reject, values in between are impossible. This means that we'll need to use a different likelihood function. The Bernoulli distribution seems like a good fit.

The Bernoulli distribution is a probability distribution for a random variable that can have an outcome of 1 with probability $p$ and an outcome of 0 with a probability of $q = 1 - p$. A quick look at the distribution shows that it oult produces values of 0 and 1.

plot(rbernoulli(100, p = 0.5))

You're probably used to using a binomial likelihood for Logit and Probit models, but the binomial distribution is really meant to describe aggragated counts. The binomial distribution includes an additional parameter for the number of trials that occured. We'll see an example of this later. The Bernoulli distribution is essentially equlivilent to a binomial distribution with only 1 trial.

Build a Logit Model

Here is the model that we want to estimate in Stan:

\[ \begin{aligned} \text{admit} &\sim \text{Bernoulli}(p) \\ \text{Logit}(p) &= \alpha + \beta_1 \text{GRE Score} + \beta_2 \text{GPA} + \beta_3 \text{Undergrad Ranking} \end{aligned} \]

Here we are interesting in estimating the probability that a given student is admitted $p$ using their GRE score, their GPA, and the ranking of their undergraduate school. Notice that we have an additional function in this model: the Logit() function.

This function is known as a "link" function because it maps the observed 0-1 outcome onto the range of 0-1.0. This link function will prevent the model from predicting outcomes above 1.0 or below 0. When we built our linear model with the normal distribution in the last tutorial, we didn't need a link function because we were estimating the mean of a continuous outcome (technically we used the identity link function).

Let's load CmdStanR,

library(cmdstanr)
register_knitr_engine(override = FALSE) # This registers cmdstanr with knitr so that we can use
# it with R Markdown.

then program the model:

data {
  int n; //number of observations in the data
  int<lower = 0, upper = 1> admit[n]; //integer of length n for admission decision
  vector[n] gre; //vector of length n for GRE scores
  vector[n] gpa; //vector of length n for GPA
  vector[n] ranking; //vector of length n for school ranking
}

parameters {
  real alpha; //intercept parameter
  vector[3] beta; //vector of coefficients
}

model {
  //linear predictor
  vector[n] p;
  
  //linear equation
  p = alpha + beta[1] * gre + beta[2] * gpa + beta[3] * ranking;
  
  //likelihood and link function
  admit ~ bernoulli_logit(p);
}

Now let's prepare the data and estimate the model.

library(tidybayes)

model.data <- 
  gre.data %>%
  rename(ranking = rank) %>%
  compose_data()

# sample
logit.fit <- logit.model$sample(data = model.data)

# print results
print(logit.fit)

Here are the interpretations of each coefficient. Notice that they are a bit different since we are using a Logit link function. We can "undo" the link function to convert each estimate back to the linear scale.

  • alpha: For a student with a 0 on the GRE, a GPA of 0, and an undergraduate school ranking of 0, the expected admission probability is $logit^{-1}(-3.51)$, which we can calculate in R with plogis(-3.51) * 100 = r plogis(-3.51) * 100 percent.

  • beta[1]: Comparing two applicants who differ by 1 point on the GRE, the model predicts a positive difference in the probability of admission of about 50%.

  • beta[2]: Comparing two applicants who differ by 1 point in their GPA, the model predicts a positive difference in the probability of admission of about 68%.

  • beta[3]: Comparing two applicants who differ by 1 point in their undergrad school ranking, the model predicts a positive difference in the probability of admission of about 0.36%.

When dealing with the logit models, plots can go a long way towards helping us understand the results. We'll plot the relationship between admission and GPA.

library(rstan)
logit.fit <- read_stan_csv(logit.fit$output_files())
logit.coefs <- 
  as.data.frame(logit.fit, pars = c("alpha", "beta[1]", "beta[2]", "beta[3]")) %>%
  summarise(alpha = mean(alpha),
            beta_1 = mean(`beta[1]`),
            beta_2 = mean(`beta[2]`),
            beta_3 = mean(`beta[3]`))

ggplot(data = gre.data, aes(x = gpa, y = admit)) +
  geom_point(position = position_jitter(width = 0, height = 0.05), shape = 1) +
  stat_function(fun = function(x) plogis(logit.coefs$alpha + logit.coefs$beta_2 * x))

sims <- as.data.frame(success.robit.rep.fit)
sims <-
  sims %>%
  mutate(n = row_number()) %>%
  sample_n(size = 100)

lines <- purrr::map(1:100, function(i) stat_function(fun = function(x) invlogit(sims[i, 1] + sims[i, 4] * x), 
                                                     size = 0.08, color = "gray"))
success.plot <- success.plot + lines
success.plot <- 
  success.plot + 
  stat_function(fun = function(x) invlogit(fixef(success.robit.rep.fit)[1] + fixef(success.robit.rep.fit)[4] * x),
                size = 1)
success.plot

We can also add uncertainty to the plot:

plot <- 
  ggplot(data = gre.data, aes(x = gpa, y = admit)) +
  geom_point(position = position_jitter(width = 0, height = 0.05), shape = 1)

sims <- as.data.frame(logit.fit)
sims <-
  sims %>%
  mutate(n = row_number()) %>%
  sample_n(size = 100) # select a random sample of 100 values from the posterior

lines <- 
  purrr::map(1:100, function(i) stat_function(fun = function(x) plogis(sims[i, 1] + sims[i, 3] * x), 
                                              size = 0.1, color = "gray"))

plot <- plot + lines
plot <- 
  plot + stat_function(fun = function(x) plogis(logit.coefs$alpha + logit.coefs$beta_2 * x))
plot

Picking Priors for Logit Models

When we are thinking about priors for logit models (or any generalized linear model), it's important to use the inverse link function to convert back to the linear scale. With the Gaussian model, we can just plot our priors "as-is", but with logit models we need to "undue" the link function. What is a weak prior on the linear scale might be really informative on the logit scale. To illistrate just how different priors can be, let's start by using standard weakly informative priors for a normal distribution.

# expectations for the average admission rate
sample.intercept <- rnorm(1000, mean = 0, sd = 10)
plot(density(plogis(sample.intercept)))

With this prior, the model thinks that applicants are either never get accepted or that they always get accepted. This is extreme and we can do better:

sample.intercept <- rnorm(1000, mean = 0, sd = 1.5)
plot(density(plogis(sample.intercept)))

This is still wide, but already a lot more reasonable. Let's simulate the rest of the priors.

# expectations for the average admission rate (around 26%)
sample.intercept <- rnorm(1000, mean = -1, sd = 1.5)
plot(density(plogis(sample.intercept)))

# expectations for the effect of GRE on admission
sample.gre <- rnorm(1000, mean = 0.5, sd = 1)
plot(density(plogis(sample.gre)))

# prior predictive simulation for admission given the priors
prior_admit <- rbernoulli(1000, plogis(sample.gre + sample.intercept))
plot(density(as.numeric(prior_admit)))

Here the model expects that more people are rejected than are accepted. Now we can add these priors into the model:

data {
  int n; //number of observations in the data
  int<lower = 0, upper = 1> admit[n]; //integer of length n for admission decision
  vector[n] gre; //vector of length n for GRE scores
  vector[n] gpa; //vector of length n for GPA
  vector[n] ranking; //vector of length n for school ranking
}

parameters {
  real alpha; //intercept parameter
  vector[3] beta; //vector of coefficients
}

model {
  //linear predictor
  vector[n] p;
  
  //linear equation
  p = alpha + beta[1] * gre + beta[2] * gpa + beta[3] * ranking;
  
  //prior expectations
  alpha ~ normal(-1, 1.5);
  beta ~ normal(0.5, 1.0);
  
  //likelihood and link function
  admit ~ bernoulli_logit(p);
}

and restimate the model:

# sample
logit.fit <- logit.model$sample(data = model.data)

# print results
print(logit.fit)

Finally, we can add in posterior predictive draws in the generated quantities block like this:

data {
  int n; //number of observations in the data
  int<lower = 0, upper = 1> admit[n]; //integer of length n for admission decision
  vector[n] gre; //vector of length n for GRE scores
  vector[n] gpa; //vector of length n for GPA
  vector[n] ranking; //vector of length n for school ranking
}

parameters {
  real alpha; //intercept parameter
  vector[3] beta; //vector of coefficients
}

model {
  //linear predictor
  vector[n] p;
  
  //linear equation
  p = alpha + beta[1] * gre + beta[2] * gpa + beta[3] * ranking;
  
  //prior expectations
  alpha ~ normal(-1, 1.5);
  beta ~ normal(0.5, 1.0);
  
  //likelihood and link function
  admit ~ bernoulli_logit(p);
}

generated quantities {
  //replications for the posterior predictive distribution
  real y_rep[n] = bernoulli_logit_rng(alpha + beta[1] * gre + beta[2] * gpa + 
    beta[3] * ranking);
}
logit.fit <- logit.model$sample(data = model.data)

print(logit.fit)

stanfit <- read_stan_csv(logit.fit$output_files())

# extract the fitted values
y.rep <- rstan::extract(stanfit)[["y_rep"]]
y <- gre.data$admit

library(bayesplot)
ppc_ecdf_overlay(y, y.rep[sample(nrow(y.rep), 100), ], discrete = TRUE)

Probit Model with the Bernoulli Distribution

We can change the logit link to a probit link with the Phi function in Stan, which is the cumulative normal distribution function:

data {
  int n; //number of observations in the data
  int<lower = 0, upper = 1> admit[n]; //integer of length n for admission decision
  vector[n] gre; //vector of length n for GRE scores
  vector[n] gpa; //vector of length n for GPA
  vector[n] ranking; //vector of length n for school ranking
}

parameters {
  real alpha; //intercept parameter
  vector[3] beta; //vector of coefficients
}

model {
  //linear model
  vector[n] p;
  p = alpha + beta[1] * gre + beta[2] * gpa + beta[3] * ranking;
  
  //likelihood and link function
  admit ~ bernoulli(Phi(p));
}

We already have the data ready, so we can get right to sampling from the model.

probit.fit <- probit.model$sample(data = model.data)

print(probit.fit)

To "undue" the probit link function, we can use prnom() in R. After doing this, we see that the coefficient estimates are very similar to the logit link function.

Aggregated Events

When we're dealing with aggregated 0-1 data, we need to use the binomial distribution.

For an example of this, we'll use a data set on graduate admissions to UC Berkeley.

admit.data <- as.data.frame(UCBAdmissions)

glimpse(admit.data)

Here we have the total number of applicants admitted and rejected by gender and department. Let's use this data to test for gender discrimination in admissions. To do this, we'll build a model with a binomial likelihood function:

data {
  int n; 
  int admit[n];
  int applications[n]; //number of applications for each outcome
  vector[n] gender;
}

parameters {
  real alpha;
  vector[1] beta;
}

model {
  //linear model
  vector[n] p;
  p = alpha + beta[1] * gender;
  
  //likelihood and link function
  admit ~ binomial_logit(applications, p);
}

Notice the binomial likelihood involves an additional parameter: the number of trials for each outcome. This is captured by the applications variable.

model.data <- 
  admit.data %>%
  select(Admit, Gender, Freq) %>%
  rename(admit = Admit,
         gender = Gender,
         applications = Freq) %>%
  compose_data()

# estimate
logit.fit <- binomial.logit.model$sample(data = model.data)

# summarize results
print(logit.fit)
Previous
Next