--- title: "C10 Logistic Regression" author: "David Housman" format: docx editor: visual fig-width: 6 fig-height: 4 execute: enabled: false --- ## Setup Load the packages for data importing (`readr`), wrangling (`dplyr` and `tidyr`), and visualizing (`ggplot2`). ```{r} #| message: false #| warning: false library(tidyverse) ``` ## Bogus Example Suppose $y$ is a binary response variable and $x$ is a quantitative explanatory variable. Logistic regression obtains a model for the probability $p(x)$ that $y$ is 1 given the value $x$ of the explanatory variable. ```{r} #Create binary vs. quantitative data bogus = tibble( x = c(0, 2, 3, 4, 5), y = c(0, 1, 0, 1, 1) ) #Obtain logistic model model = glm(y ~ x, data = bogus, family = binomial()) #Define the model function best = function(x) predict(model, newdata = tibble(x = x), type = "response") #Display scatter plot and model ggplot(bogus, aes(x = x, y = y)) + geom_point() + geom_function(fun = best, linewidth = 1) + scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.25)) + xlab("x") + ylab("Pr(y | x)") + ggtitle("Bogus") ``` The model predicts that a case with $x = 1$ has about a 24% chance of $y = 1$. For quantitative-binary data $$(x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n),$$ rather than a linear model $p = a + b x$, we use a logistic model $$p(x) = \frac{1}{1 + e^{-(a + b x)}},$$ and rather than minimizing $\sum_{i=1}^n (y_i - p(x_i))^2$, we minimize the likelihood function $$L(a,b) = \prod_{i=1}^n p(x_i)^{y_i} (1 - p(x_i)^{1 - y_i},$$ or, equivalently, minimize the log-likelihood function $$\ell(a,b) = \sum_{i=1}^n y_i\log(p(x_i)) (1 - (1 - y_i)\log(p(x_i)).$$ The likelihood $L(a,b)$ is the probability of obtaining the sample we obtained if the population was given by the logistic model with parameters $a$ and $b$. Thus, the parameters $a$ and $b$ chosen correspond to the population that has the largest probability of yielding the actual sample. Unlike linear regression parameters for which there are explicit formulas, there are no explicit formulas for the logistic regression parameters. Instead numerical optimization algorithms are used; hence, our use of `glm` rather than `lm`. Blackbox functions `predict` and `coef` still work in this different context. ```{r} coef(model) ``` The coefficients tell us that $$p(x) = \dfrac{1}{1+e^{−(-2.2363859 + 0.9926663x)}}$$ or in terms of log-odds $$\log\left(\dfrac{p(x)}{1-p(x)}\right) = -2.2363859 + 0.9926663x$$ Analogous to linear regression, there are an intercept and a slope to interpret. The intercept is not always meaningful, and so a different model point may be chosen. Based on the following computation, we could say that the model predicts a case with $x = 0$ has probability 0.097 of the response being 1, or a case with $x = 2.25$ has probability 0.5 of the response being 1. Observe that both of these numbers could have been estimated from the plot. ```{r} tibble( xvalues = c(0, 2.2363859/0.9926663), p1 = predict(model, newdata = tibble(x = xvalues), type = "response"), p2 = 1 / (1 + exp(-(-2.2363859 + 0.9926663*xvalues))) ) ``` The following calculation tells us that a one unit increase in the explanatory variable multiplies the odds of the response being 1 by 2.7. ```{r} exp(0.9926663) ``` Logistic regression measures of model fit are analogous in spirit (but not in detail) to some measures of linear regression model fit. ```{r} tibble( NullDeviance = model$null.deviance, Deviance = model$deviance, R2_McF = 1 - Deviance / NullDeviance, AIC = model$aic) ``` Analogous to the TSS, the null deviance (6.73 for our example) is a measure of the variation in the data if the intercept-only model ($p = 3/5$) is used. Analogous to the RSS, the residual deviance (4.57 in our example) is a measure of the variation in the difference between the response data and model predictions. Analogous to $R^2$, McFadden's pseudo-$R^2$ is calculated in the same manner ($1-4.57/6.73 = 0.321$). Rather than a proportion of variation explained by $x$, McFadden's pseudo-$R^2$ is a proportion of log-likelihood information loss: the logistic model reduces the log-likelihood loss by 32.1% relative to predicting a constant probability (3/5 chance that a case has a response of 1) for every case. Logistic regression models with a McFadden's pseudo-$R^2$ value between 0.2 and 0.4 are considered to be quite strong. Analogous to the adjusted $R^2$, the Akaike Information Criterion (8.57 in our example) can be used to compare models with a single number that takes into account both goodness of fit and model complexity, with smaller values preferred. Only differences in AIC between pairs of models matters. If the difference is less than 2, then the predictive power of the two models are the same whereas if the difference is at least 10, then the predictive power of the lower-AIC model is overwhelmingly better. Based upon the following calculation, the addition of a squared term is counterproductive. ```{r} model2 = glm(y ~ x + I(x^2), data = bogus, family = binomial()) tibble( NullDeviance = model2$null.deviance, Deviance = model2$deviance, R2_McF = 1 - Deviance / NullDeviance, AIC = model2$aic) ``` ## R Tidyverse Example Our goal will be to use the Math 323 introduction data to predict gender as a function of height, weight, and/or excitement. Read the Math 323 Introduction data, remove non-binary students and those with clearly incorrectly coded heights, keep only the variables of interest, remove rows with missing data, and recode Gender as 0 for male and 1 for female. ```{r} #| message: false intro = read_csv("03 Introduction.csv") |> filter(Gender != "N") |> filter(Height > 10) |> select(Gender, Height, Weight, Excitement) |> drop_na() |> mutate(Gender = as.integer(Gender == "F")) ``` Obtain and visualize the logistic regression model. ```{r} #Obtain logistic model model = glm(Gender ~ Height, data = intro, family = binomial()) #Define the model functon best = function(x) predict(model, newdata = tibble(Height = x), type = "response") #Display scatter plot and model h = 0.05 ggplot(intro, aes(x = Height, y = Gender)) + geom_jitter(height = h, width = 0, alpha = 0.4) + geom_function(fun = best, linewidth = 1) + scale_y_continuous(limits = c(0-h, 1+h), breaks = seq(0, 1, 0.25)) + xlab("Height (in)") + ylab("Pr(Female | Height)") + ggtitle("Math 323 Students") ``` Display quantitative information about the logistic model. ```{r} coef(model) ``` We could use these numbers to write the best-fit logistic model for the probability of being a female as a function of height; however, I get more insight from the graph than the formula. ```{r} tibble( x = c(0, 31.3271/0.4664), p1 = predict(model, newdata = tibble(Height = x), type = "response"), p2 = 1 / (1 + exp(-(31.3271 - 0.4664*x))) ) ``` This computation allows us to say that the model predicts a person 0 inches tall has probability 1 of being a female (not overly insightful), and a person 67.2 inches tall has probability 0.5 of being a female (more insightful).. ```{r} exp(-0.4664) ``` This tells us that a one inch increase in the height multiplies the odds of the person being a female by 0.627. ```{r} tibble( NullDeviance = model$null.deviance, Deviance = model$deviance, R2_McF = 1 - Deviance / NullDeviance, NullProb = sum(intro$Gender) / nrow(intro), AIC = model$aic) ``` McFadden's pseudo-$R^2$ tells us that the logistic model using height reduces the log-likelihood loss by 32.4% relative to predicting a constant probability (0.356) for everyone. In more qualitative terms, height provides a substantial improvement over baseline prediction of gender.The AIC is only useful when comparing different models. Here is code that computes the AIC for several models. ```{r} AIC = function(f) { model = glm(f, data = intro, family = binomial()) tibble( Formula = deparse(f), AIC = model$aic) } bind_rows( AIC(Gender ~ Height), AIC(Gender ~ Weight), AIC(Gender ~ Excitement), AIC(Gender ~ Height + Weight), AIC(Gender ~ Height * Weight) ) ``` Excitement is an extremely poor predictor for gender, height is substantially better than weight, combining height and weight is moderately better than height alone, and adding an interaction term to height and weight is counter productive.