Data Analysis and Reporting Results

With a cleaned (tidy) dataset ready to go and with some detailed exploratory data analysis out of the way, we’re ready to test hypotheses. Rather than focus on the details of a specific programming language, this session will help you build your statistical modeling skills. We’ll talk about how to build models for different types of data and show how each model can be implemented in R or Python.

To get started, let’s load the clean data. You can do this with R or Python. I’m going to load it in R and call it up in Python when I need to (there’s no need to load it with both languages).

library(tidyverse)
flight_data <- read_csv("Documents/my_project/flights_cleaned.csv")

The General Linear Model

During our data exploration we investigated several interesting questions, but when we move on to modeling our data we want to investigate the relationship between variables in a more precise fashion. That’s where statistics comes in. Statistics allows us to talk about relationships between multiple variables in terms of probability which helps to quantify our uncertainty of any specific answer we might provide.

In the flight data, we investigated patterns in the data with visual methods but now we want to build statistical models to test these patterns. Let’s start simple. Are planes that depart late more likely to arrive late? let’s reproduce the plot we made between these variables last time, using Matplotlib this time instead of ggplot2:

import matplotlib.pyplot as plt
plt.scatter(r.flight_data["dep_delay"], r.flight_data["arr_delay"])
plt.xlabel("Departure Delays")
plt.ylabel("Arrival Delays")
plt.show()

plt.clf()

To quantify this relationship, we can use a regression model. Let’s start with linear regression. Linear regression is the first model most students learn and they learn to write it like this:

\[ y_i = \alpha + \beta x_i + \epsilon_i \]

where \(y_i\) is the outcome variable, \(\alpha\) is the estimated intercept, \(\beta\) is the estimated slope, \(x_i\) is a column of data, and \(\epsilon_i\) is the error term which measures the difference between the observed and predicted data for each row of data.

It’s not wrong to write your model like this, but it hides a lot of assumptions that you are making about your model. For example, this model assumes that your data is normally distributed. One way of writing this is like this:

\[ \begin{aligned} y_i &= \alpha + \beta x_i + \epsilon \\ \epsilon_i &\sim N(0, \sigma^2) \end{aligned} \]

This is better because we can see that we are assuming that the errors are normally distributed with a mean of 0 and a constant variance (homoscedastic). However, writing our models this way really only works when we’re working with normally distributed data. The even better, more general way to write this model is like this:

\[ \begin{aligned} y_i &\sim \text{Normal}(\mu_i, \sigma^2) \\ \mu_i &= \alpha + \beta x_i \end{aligned} \]

This tells us that our model assumes that the data are normally distributed, has a constant variance, and we see the predictors that we are using to predict the mean. Cool, but how to do implement this in software? Let’s see how starting with R:

fit_r <- glm(formula = arr_delay ~ dep_delay,
             family = gaussian(link = "identity"),
             data = flight_data)

Let’s break this down. We’re using the glm function, which stands for generalized linear model, and a Gaussian family, which is the normal distribution, and an “identity” link function, which we’ll talk more about later. Finally, we just feed in the data with data =.

Now let’s see this in Python:

import statsmodels.formula.api as smf 
import statsmodels.api as sm

fit_py = smf.glm(formula = "arr_delay ~ dep_delay", 
                 family = sm.families.Gaussian(sm.families.links.identity()),
                 data = r.flight_data)

fit_py_fit = fit_py.fit()

We can view the results in R like this:

summary(fit_r)
## 
## Call:
## glm(formula = arr_delay ~ dep_delay, family = gaussian(link = "identity"), 
##     data = flight_data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -107.587   -11.005    -1.883     8.938   201.938  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.8994935  0.0330195  -178.7   <2e-16 ***
## dep_delay    1.0190929  0.0007864  1295.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 324.9891)
## 
##     Null deviance: 652114033  on 327345  degrees of freedom
## Residual deviance: 106383218  on 327344  degrees of freedom
## AIC: 2822273
## 
## Number of Fisher Scoring iterations: 2

and in Python like this:

print(fit_py_fit.summary())
##                  Generalized Linear Model Regression Results                  
## ==============================================================================
## Dep. Variable:              arr_delay   No. Observations:               327346
## Model:                            GLM   Df Residuals:                   327344
## Model Family:                Gaussian   Df Model:                            1
## Link Function:               identity   Scale:                          324.99
## Method:                          IRLS   Log-Likelihood:            -1.4111e+06
## Date:                Wed, 02 Feb 2022   Deviance:                   1.0638e+08
## Time:                        11:59:04   Pearson chi2:                 1.06e+08
## No. Iterations:                     3   Pseudo R-squ. (CS):             0.9941
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept     -5.8995      0.033   -178.667      0.000      -5.964      -5.835
## dep_delay      1.0191      0.001   1295.850      0.000       1.018       1.021
## ==============================================================================

The normal distribution is a good choice when we’re modeling data that is continuous and can theoretically range from \(-\infty\) to \(\infty\). That fits a lot of scenarios but not all scenarios.1 One of the most common cases where the normal distribution isn’t great is if we are trying to model a binary outcome.

Generalizing the Linear Model

In order to generalize our linear model to model any type of data, we need two components: a likelihood function and a link function.

What is a likelihood function?

Likelihood functions are essentially probability models of our data. They describe the data generating process. Technically, they show what the parameter values need to be in our model to maximize the chances of observing the data that we have. That’s technical, but just remember that one of the main goals of statistics is to build models that can be make accurate predictions and describe data. We do this by choosing likelihood functions that describe the observed data.

Putting it All Together

Our models almost always use the same linear structure (\(\mu_i = \alpha + \beta x_i\)), but we can choose any likelihood and link functions that we need to best match the data we are modeling. With these pieces we can generalize our linear equation to model any type of data we want, hence the generalized linear model.

Binomial Regression

For example, let’s suppose that we wanted to build a model to understand why delays happen. Maybe planes that preparing to travel shorter distances are more likely to get delayed since shorter travel times mean faster turn around times. Let’s generate a binary indicator for whether a flight was delayed that will serve as the outcome variable:

flight_data <- flight_data %>% 
  mutate(delay_dum = ifelse(dep_delay > 0, yes = 1, no = 0))

Which probability distribution would best match this data generating process? Here we have a 0-1 outcome so our data generating process is limited to either 0 or 1. This means that we need a discrete distribution that produces values that are either 0 or 1. So, we have two questions to answer:

  1. What likelihood function (probability distribution) should we use?
  2. What link function should we use?

The binomial distribution is a good choice for a likelihood in this case because it is a distribution of discrete values of either 0 or 1. So far, this means that our equation looks like this:

\[ \begin{aligned} y_i &\sim \text{Binomial}(1, p_i) \\ p_i &= \alpha + \beta x_i \end{aligned} \]

Let’s plot a binomial distribution to get an idea of what it looks like. Suppose we flipped a coin 3 times. Here is a visual (using the binomial distribution) of the probability of getting 0, 1, 2, and 3 heads:

library(tidyverse)

binomial_data <- tibble(
  # number of trials
  n = 3,
  
  # probability of success on a given trial
  p = 0.5,
  
  # range of successes (heads)
  s = 0:3,
  
  # probability density
  prob = dbinom(x = s, size = n, prob = p)
)

ggplot(data = binomial_data, aes(x = s, y = prob)) +
  geom_col() +
  labs(x = "# of Successes", y = "Probability",
       title = "The Binomial Distribution")

Now, what if we try this 100 times?

binomial_data <- tibble(
  # number of trials
  n = 100,
  
  # probability of success on a given trial
  p = 0.5,
  
  # range of successes (heads)
  s = 0:100,
  
  # probability density
  prob = dbinom(x = s, size = n, prob = p)
)

ggplot(data = binomial_data, aes(x = s, y = prob)) +
  geom_col() +
  labs(x = "# of Successes", y = "Probability",
       title = "The Binomial Distribution")

Ok awesome, the binomial distribution seems like a good choice. Here’s the problem though. The linear equation (\(p_i = \alpha + \beta x_i\)) can give us just about any value we want even though we only want values between 0 and 1. In other words, we’re using a probability distribution that produces 0’s and 1’s but there’s nothing stopping our model’s predictions for \(p_i\) from staying between 0 and 1. That’s where the link function comes in.

Remember how we entered family = gaussian(link = identity) for the linear regression? When you estimate a linear regression, you’re technically using the identity link function, which basically means that you aren’t using a link function. When our data is discrete, one type of link function we could use is the logit link function. The logit link function restricts \(p_i\) to fall between 0 and 1. Mathematically, the logit link function is defined as:

\[ \ln \big(\frac{p_i}{1 - p_i}\big) \]

This simple equation ensures that we always get a prediction between 0 and 1.

Let’s see the link function in action:

ggplot(data = tibble(x = c(0, 1)), aes(x = x)) +
  stat_function(fun = function(x) log(x / (1 - x)), 
                geom = "line") +
  coord_flip() +
  labs(y = "x", x = "logistic(x)", title = "Plotting the Logit Link Function")

Alright, let’s fit the model already!

In R:

binom_fit_r <- glm(formula = delay_dum ~ distance,
                   family = binomial(link = "logit"),
                   data = flight_data)
summary(binom_fit_r)
## 
## Call:
## glm(formula = delay_dum ~ distance, family = binomial(link = "logit"), 
##     data = flight_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1526  -0.9930  -0.9697   1.3709   1.4147  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.503e-01  6.255e-03  -87.98   <2e-16 ***
## distance     9.865e-05  4.839e-06   20.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 437896  on 327345  degrees of freedom
## Residual deviance: 437481  on 327344  degrees of freedom
## AIC: 437485
## 
## Number of Fisher Scoring iterations: 4

In Python:

binom_py = smf.glm(formula = "delay_dum ~ distance", 
                   family = sm.families.Binomial(sm.families.links.logit()),
                   data = r.flight_data)

binom_fit_py = binom_py.fit()
print(binom_fit_py.summary())
##                  Generalized Linear Model Regression Results                  
## ==============================================================================
## Dep. Variable:              delay_dum   No. Observations:               327346
## Model:                            GLM   Df Residuals:                   327344
## Model Family:                Binomial   Df Model:                            1
## Link Function:                  logit   Scale:                          1.0000
## Method:                          IRLS   Log-Likelihood:            -2.1874e+05
## Date:                Wed, 02 Feb 2022   Deviance:                   4.3748e+05
## Time:                        11:59:06   Pearson chi2:                 3.27e+05
## No. Iterations:                     4   Pseudo R-squ. (CS):           0.001265
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept     -0.5503      0.006    -87.977      0.000      -0.563      -0.538
## distance    9.865e-05   4.84e-06     20.388      0.000    8.92e-05       0.000
## ==============================================================================

Great! Now how do we interpret the estimates? The easiest way to interpret them is to convert them to the probability scale (instead of talking about log-odds, or odds ratios). We do this by “undoing” the link function. The inverse logit link function is the logistic function and we can call it up in R with plogis() or in Python by defining the inverse logit function manually.

So, a one unit increase in distance is associated with a:

plogis(9.865e-05)
## [1] 0.5000247

increase in the relative probability of being delayed. Or with Python:

import numpy as np

def inv_logit(p):
    return np.exp(p) / (1 + np.exp(p))

inv_logit(9.865e-05)
## 0.50002466249998

Considering that a 50% probability is basically a coin flip, distance doesn’t really have any predictive power.

Exercise

  1. What is the likelihood function and the link function for this model:

    \[ \begin{aligned} y_i &\sim \text{Binomial}(1, p_i) \\ \text{probit}(p_i) &= \alpha + \beta x_i \end{aligned} \]

  2. Use Python or R to fit the model above and print the results.

Poisson Regression

What if our research question asked how if the number of delays depended on the month of the year? We would need a different model for this. Let’s prepare the data first. To do that, we’ll use our tidyverse skills to calculate the number of delays each month.

flight_data_agg <- flight_data %>% 
  group_by(year, month) %>% 
  summarize(num_delays = sum(delay_dum))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
head(flight_data_agg)
## # A tibble: 6 × 3
## # Groups:   year [1]
##    year month num_delays
##   <dbl> <dbl>      <dbl>
## 1  2013     1       9620
## 2  2013     2       9088
## 3  2013     3      11166
## 4  2013     4      10484
## 5  2013     5      11227
## 6  2013     6      12558

Now we have a data generating process that produces discrete positive values from 0 to \(\infty\).

  1. What distribution (likelihood function) describes a process like this?
  2. What link function should we use?

For this situation, the Poisson distribution is a good choice for a likelihood since it describes a discrete process that generates a count of events taking place. This means that the Poisson distribution is defined for values greater than all the way to \(\infty\). In other words, it works for all positive values. Technically, the Poisson distribution is undefined for values of 0. So if we had a data generating process that produced positive values including 0, we would need to use a zero-inflated Poisson likelihood function.

Let’s see what the Poisson distribution looks like:

poisson_data <- tibble(
  # range of events
  s = 0:20,
  
  # lambda (mean number of event occurances)
  rate = 2,
  
  # probability density
  prob = dpois(x = s, lambda = rate)
)

ggplot(data = poisson_data, aes(x = s, y = prob)) +
  geom_col() +
  labs(x = "# of Events", y = "Probability",
       title = "The Poisson Distribution")

But how do we map the linear predictor to the range of all possible values? We’ll use the log link function since the log function is only defined for positive values greater than 0.

Let’s visualize this link function:

ggplot(data = tibble(x = c(0, 8)), aes(x = x)) +
  stat_function(fun = function(x) log(x), 
                geom = "line") +
  coord_flip() +
  labs(y = "x", x = "exp(x)", title = "Plotting the Log Link Function")

It’s always positive!

Mathematically, the Poisson model looks like this:

\[ \begin{aligned} y_i &\sim \text{Poisson}(\lambda_i) \\ \log(\lambda_i) &= \alpha + \beta x_i \end{aligned} \]

Here is the fitted model in R:

poisson_fit_r <- glm(formula = num_delays ~ month,
                     family = poisson(link = "log"),
                     data = flight_data_agg)
summary(poisson_fit_r)
## 
## Call:
## glm(formula = num_delays ~ month, family = poisson(link = "log"), 
##     data = flight_data_agg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -29.995  -15.946    2.371   11.586   28.875  
## 
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) 9.2592482  0.0059830 1547.585  < 2e-16 ***
## month       0.0020939  0.0008105    2.583  0.00978 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4208.7  on 11  degrees of freedom
## Residual deviance: 4202.0  on 10  degrees of freedom
## AIC: 4339.1
## 
## Number of Fisher Scoring iterations: 4

And in Python:

poisson_py = smf.glm(formula = "num_delays ~ month",
                     family = sm.families.Poisson(sm.families.links.log()),
                     data = r.flight_data_agg)

poisson_fit_py = poisson_py.fit()

print(poisson_fit_py.summary())
##                  Generalized Linear Model Regression Results                  
## ==============================================================================
## Dep. Variable:             num_delays   No. Observations:                   12
## Model:                            GLM   Df Residuals:                       10
## Model Family:                 Poisson   Df Model:                            1
## Link Function:                    log   Scale:                          1.0000
## Method:                          IRLS   Log-Likelihood:                -2167.6
## Date:                Wed, 02 Feb 2022   Deviance:                       4202.0
## Time:                        11:59:07   Pearson chi2:                 4.18e+03
## No. Iterations:                     4   Pseudo R-squ. (CS):             0.4266
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept      9.2592      0.006   1547.585      0.000       9.248       9.271
## month          0.0021      0.001      2.583      0.010       0.001       0.004
## ==============================================================================

This isn’t the best model because it doesn’t really make sense to treat month as a continuous variable, but we’re not worried about that here.

Now, how do I interpret THIS model? Just like before, we need to “undo” the link function. In this case, the exponential \(\exp()\) is the opposite of the log link function.

So, a one-unit increase in month (which doesn’t really make sense) is associated with a

(1 - exp(0.0020939)) * 100
## [1] -0.2096094

percent decreased in the number of delayed flights.

In Python it’s:

import numpy as np
(1 - np.exp(0.0020939)) * 100
## -0.20960937394949308

Modeling with Other Distributions

We talked about Poisson and Binomial, but there are plenty of other likelihood functions you can use. For example, need to model a proportion? Something like, say, the percentage of a city that gets stopped by the police? For that you would want to use the beta distribution.

The beta distribution is for continuous data that can only be between 0 and 1. What link function would we use? Well, you already know of a link function that will limit the linear predictor to be between 0 and 1 - the logit link function. You can fit a beta regression in R with the betareg package.

What if your outcome consists of different unordered categories? For that you’d need the categorical distribution which you can do with the nnet R package.

Reporting Results

What’s the best way to report the results of your research project? Markdown! You can write your entire paper in markdown and include all your code for plots and models so that your document updates dynamically. Check out this blog post on using RMarkdown for writing scientific papers: https://libscie.github.io/rmarkdown-workshop/handout.html. What should you report? It’s generally helpful to report 3 things:

  1. A table of your model’s results
  2. A plot of your model’s results
  3. A plot of your model’s fit to the data

Building a Table of Regression Results

There are so many ways to create a table for you model’s results, but maybe the easiest way is with a package called Stargazer which is available for R and Python. Let’s use stargazer to build a table for the binomial model results. By default, stargazer produces a table in , but we can change that with the type = argument.

#install.packages("stargazer")
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
stargazer(binom_fit_r, type = "text")
## 
## =============================================
##                       Dependent variable:    
##                   ---------------------------
##                            delay_dum         
## ---------------------------------------------
## distance                   0.0001***         
##                            (0.00000)         
##                                              
## Constant                   -0.550***         
##                             (0.006)          
##                                              
## ---------------------------------------------
## Observations                327,346          
## Log Likelihood           -218,740.600        
## Akaike Inf. Crit.         437,485.200        
## =============================================
## Note:             *p<0.1; **p<0.05; ***p<0.01

You can fully customize this table within the stargazer function so that you don’t need to do any manual edits. Avoiding manual editing is good because it reduces errors and makes it so that your tables a produced programmatically which helps with reproducability.

#install.packages("stargazer")
library(stargazer)

stargazer(binom_fit_r, type = "html",
          title = "Regression Results of Departure Delays and Travel Distance",
          covariate.labels = c("Distance"),
          dep.var.labels = c("Departure Delay"),
          notes = c("Model fit using a binomial likelihood function with a logit link."))
Regression Results of Departure Delays and Travel Distance
Dependent variable:
Departure Delay
Distance 0.0001***
(0.00000)
Constant -0.550***
(0.006)
Observations 327,346
Log Likelihood -218,740.600
Akaike Inf. Crit. 437,485.200
Note: p<0.1; p<0.05; p<0.01
Model fit using a binomial likelihood function with a logit link.

Let’s build a table for the Poisson model using stargazer in Python (you’ll need to install it first by typing pip install stargazer in the terminal):

pip install stargazer
## Requirement already satisfied: stargazer in /Users/nickjenkins/Library/r-miniconda-arm64/envs/r-reticulate/lib/python3.8/site-packages (0.0.5)

Now we can build the table:

from stargazer.stargazer import Stargazer

results_table = Stargazer([poisson_fit_py])
results_table.title("Regression Results of the Number of Departure Delays and Months")
results_table.rename_covariates({"month": "Month"})
results_table.add_custom_notes(["Model fit using a Poisson likelihood function with a log link."])
results_table.dependent_variable_name("Number of Delays")
results_table.render_html()
’Regression Results of the Number of Departure Delays and Months
Number of Delaysnum_delays
(1)
Intercept 9.259***
(0.006)
Month 0.002***
(0.001)
Observations 12
R2
Adjusted R2
Residual Std. Error 1.000 (df=10)
F Statistic (df=1; 10)
Note: p<0.1;p<0.05;p<0.01
Model fit using a Poisson likelihood function with a log link.

We can’t render the table as a .txt with the Python version, unfortunately.

Plotting Your Model’s Results

The next thing to create is a plot of your model’s coefficients, known as a coefficient plot. In R, we can do this pretty easily with the broom package and ggplot2.

library(broom)

tidy(binom_fit_r, conf.int = TRUE) %>% 
  ggplot(aes(x = estimate, y = term, xmin = conf.low, xmax = conf.high)) +
  geom_pointrange()

Again, a plot like this should be fully customized to “speak for itself.”

tidy(binom_fit_r, conf.int = TRUE) %>% 
  ggplot(aes(x = estimate, y = term, xmin = conf.low, xmax = conf.high)) +
  geom_pointrange() +
  labs(x = "Point Estimate (log odds)",
       y = "Coefficient",
       title = "Regression Results of Departure Delays and Distance",
       caption = "Model fit with binomial likelihood and logit link function.") +
  scale_y_discrete(labels = c("Intercept", "Distance"))

We can make a similar plot for the Poisson model as well. I’m sure this is possible in Python, but I couldn’t find a quick way to do it.

Plotting Your Model’s Fit to the Data

The last plot we’ll cover is how to show your fitted model against your data. These plots are good because they give you a sense of how well you model fits and can help you identify problems. Let’s plot the binomial model fit in R:

# calculate predicted values
binom_preds <- tidy(binom_fit_r, conf.int = TRUE)

pred_data <- tibble(
  distance = seq(from = min(flight_data$distance),
                 to = max(flight_data$distance),
                 length.out = 100)
)

binom_preds <- 
  # calculate predicted values
  predict(binom_fit_r,
          newdata = pred_data,
          type = "link",
          se.fit = TRUE) %>% 
  # convert to a data frame
  data.frame() %>% 
  # calculate upper and lower confidence bands
  mutate(ll = plogis(fit - 1.96 * se.fit),
         ul = plogis(fit + 1.96 * se.fit),
         fit = plogis(fit)) %>% 
  # add predictions to the prediction data
  bind_cols(pred_data)

ggplot(data = binom_preds, aes(x = distance, y = fit)) +
  #geom_point(alpha = 0.1, color = "skyblue") +
  # add fitted line
  geom_line() +
  # add error bars
  geom_ribbon(aes(ymin = ll, ymax = ul), alpha = 0.5) +
  # add data
  geom_point(data = flight_data, aes(y = delay_dum),
             alpha = 0.1, color = "skyblue") +
  labs(x = "Distance", y = "Predicted Probability")

Resources for Further Study

To learn more about generalized linear models, I’d recommend these resources:


  1. Check this article on why the normal distribution is so common: https://towardsdatascience.com/why-is-the-normal-distribution-so-normal-e644b0a50587↩︎

Previous