---
title: "C23-24 Normal Random Variables"
author: "David Housman"
format: docx
editor: visual
fig-width: 6
fig-height: 2.4
---

```{r}
#| message: false
#| warning: false
#| echo: false
library(tidyverse)
library(gridExtra)
```

## Concepts Part 1

The *standard normal* random variable $Z$ has a distribution that is unimodal, is symmetric about its mean of 0, and has a standard deviation of 1. More generally, $X = \mu + \sigma Z$ is a *normal* random variable with mean $\mu$ and standard deviation $\sigma$. The following R code illustrates the use of the four `norm` functions: density (`d`), cumulative probability (`p`), quantiles (`q`), and random sample (`r`).

```{r}
norm_dist = tibble(
  z = seq(-3.5, 3.5, 0.05),
  pdf = dnorm(z),
  cdf = pnorm(z)
)
p1 = ggplot(norm_dist, aes(x = z, y = pdf)) + geom_line()
p2 = ggplot(norm_dist, aes(x = z, y = cdf)) + geom_line()
grid.arrange(p1, p2, nrow = 1)
qnorm((1:3)/4)
rnorm(6)
```

The pdf for the standard normal distribution is $h(z) = \dfrac{1}{\sqrt{2\pi}} e^{-z^2/2}$. The following code verifies that $h$ is a pdf with mean 0 and standard deviation 1, and it shows where the empirical rule comes from.

```{r}
#| eval: false
h = function(z) 1/sqrt(2*pi) * exp(-z^2/2)
integrate(h, -Inf, Inf)
integrate(function(z) z*h(z), -Inf, Inf)
integrate(function(z) z^2*h(z), -Inf, Inf)
integrate(h, -1, 1)
integrate(h, -2, 2)
integrate(h, -3, 3)
```

## Exercises Part 1

1.  Derive the standard normal pdf under the reasonable assumption that it is of the form $h(z) = ae^{-bz^2}$.

The unimodality, symmetry, and zero mean follow immediately.

To be a pdf, $1 = \displaystyle\int_{-\infty}^{\infty} ae^{-bz^2} \, dz$. We use some cleverness to calculate the definite integral.

$\qquad\begin{array}{lcl} 1 &=& \left(\displaystyle\int_{-\infty}^{\infty} ae^{-bx^2} \, dx\right) \left(\displaystyle\int_{-\infty}^{\infty} ae^{-by^2} \, dy\right) \\ &=& \displaystyle\int_{-\infty}^{\infty}\displaystyle\int_{-\infty}^{\infty} ae^{-bx^2}ae^{-by^2} \, dx \, dy \\ &=& a^2\displaystyle\int_{-\infty}^{\infty}\displaystyle\int_{-\infty}^{\infty} e^{-b(x^2+y^2)} \, dx \, dy \\ &=& a^2\displaystyle\int_{0}^{2\pi}\displaystyle\int_{0}^{\infty} e^{-br^2} r \, dr \, d\theta \\ &=& a^2 \cdot 2\pi \cdot \left. -\dfrac{1}{2b}e^{-br^2} \right|_{0}^{\infty} \\ &=& \dfrac{a^2\pi}{b} \end{array}$

Hence, $a = \sqrt{b/\pi}$.

For the standard deviation to be 1, the following must hold:

$\qquad\begin{array}{lcl} 1 &=& \displaystyle\int_{-\infty}^{\infty} z^2h(z) \, dz \\ &=& \displaystyle\int_{-\infty}^{\infty} az \cdot ze^{-bz^2} \, dz \\ &=& \left. az \cdot \frac{-1}{2b}e^{-bz^2} \right|_{-\infty}^{\infty} - \displaystyle\int_{-\infty}^{\infty} a \cdot \frac{-1}{2b}e^{-bz^2} \, dz \\ &=& 0 + \frac{1}{2b}\displaystyle\int_{-\infty}^{\infty} ae^{-bz^2} \, dz \\ &=& \frac{1}{2b} \end{array}$

Hence, $b = 1/2$ and $a = 1/\sqrt{2\pi}$.

Thus, $h(z) = \dfrac{1}{\sqrt{2\pi}} e^{-z^2/2}$.

2.  Determine the inflection points of $h$.

3.  Suppose $X = a + bZ$ for some real numbers $a$ and $b$. Obtain the mean, standard deviation, cdf, and pdf for $X$.

4.  Each part of the SAT is calibrated to be normally distributed with a mean of 500 and a standard deviation of 100.

    a.  What percentage of test takers obtain a score of at least 750 on the Math SAT exam?

```{r}
1 - pnorm(750, 500, 100)
```

```{r}
1 - pnorm((750-500)/100)
```

```{r}
h = function(z) 1/sqrt(2*pi)*exp(-0.5*z^2)
integrate(h, (750-500)/100, Inf)
```

```         
b.  If you want to be in the top 30% of the test takers, what must you score on the Math SAT?
```

```{r}
qnorm(1-0.3, 500, 100)
```

5.  The weight of cereal in a box is a random variable with mean 12.15 ounces and standard deviation 0.2 ounce.

    a.  What percentage of the boxes have contents that weigh under 12 ounces?

```{r}
pnorm(12, 12.15, 0.2)
```

```         
b.  If the manufacturer claims that there are 12 ounces in a box, does the percentage you just found cause concern?

c.  Whether or not it is a concern, what two things could be done to correct the situation? What would be the short run and long run costs associated with each change?

d.  How much of the first change must be made so that only 2% of boxes are less than 12 ounces?
```

```{r}
12 - 0.2*qnorm(0.02)
```

```         
e.  How much of the second change must be made so that only 2% of boxes are less than 12 ounces?
```

```{r}
(12-12.15)/qnorm(0.02)
```

## Concepts Part 2

The normal distribution arises whenever the random variable is a sum of many other independent random variables that are not too much different from each other. For example, human height is controlled by several genes and a variety of environmental factors. Symbolically, if $X_1, X_2, \dots, X_m$ are independent random variables whose distributions are similar and $m$ is sufficiently large, then $X = X_1 + X_2 + \cdots + X_m$ is approximately normal. As we have seen previously, $\mu_{X} = \mu_{X_1} + \mu_{X_2} + \cdots + \mu_{X_m}$ and $\sigma_X^2 = \sigma_{X_1}^2 + \sigma_{X_2}^2 + \cdots + \sigma_{X_m}^2$.

## Examples Part 2

Create $m$ independent similarly distributed random variables $X_1, X_2, \dots, X_m$, plot their pdfs, simulate the sum of the random variables, and compare the simulated and theoretical distributions. The first example creates random variables that have identical not bell-shaped distributions. The second example creates randomly shifted binomials with randomly chosen parameters.

```{r}
#| eval: false
#Example 1: X[i] ~ Geometric(p)
m = 25
p = 1/6

#Obtain the distribution for all of the X[i]
mu = 1/p
sigma = sqrt(1-p)/p
dist = tibble(
  x = 1:ceiling(mu+4*sigma),
  pdf = (1-p)^(x-1) * p
)

#Plot the pdf as a spike graph
ggplot(dist, aes(x = x, y = pdf)) +
  geom_col(width = 0.2)

#Simulate the sum of the random variables
#Note that R defines geometric as the number of tails before the 
#first head, rather than the number of the first head flip.
runs = 100
sums = numeric(runs)
for (t in 1:runs) {
   xdata = 1 + rgeom(m, p)
   sums[t] = sum(xdata)
}
simulation = tibble(sum = sums)

#Obtain the mean and sd of X[1] + X[2] + ... + X[m]
musum = m*mu
sigmasum = sqrt(m)*sigma
```

```{r}
#| eval: false
#Compare simulation and theory means and standard deviations
simulation |>
  summarise(mean(sum), sd(sum))
tibble(musum, sigmasum)

#Compute the claimed normal distribution
theory = tibble(
  x = seq(musum-3*sigmasum, musum+3*sigmasum, length.out=100),
  pdf = dnorm(x, musum, sigmasum), 
  cdf = pnorm(x, musum, sigmasum)
)

#Compare simulation and theoretical pdfs
ggplot(simulation, aes(x = sum, y = after_stat(density))) +
  geom_histogram(bins = 30, color = "white") +
  geom_density(linewidth = 2) +
  geom_line(aes(x = x, y = pdf), theory, color = "red", linewidth = 2)

#Compare simulation and theoretical cdfs
ggplot(simulation, aes(x = sum)) +
  stat_ecdf() +
  geom_line(aes(x = x, y = cdf), theory, color = "red")

#Compare simulation and theory with a quantile-quantile plot
ggplot(simulation, aes(sample = sum)) +
  stat_qq(distribution = qnorm, dparams = c(musum, sigmasum), alpha = 0.2) +
  geom_abline(slope = 1, intercept = 0, color = "red")
#Old incorrect code
#ggplot(simulation, aes(sample = sum)) +
#  stat_qq(distribution = qnorm, dparams = c(musum, sigmasum)) +
#  stat_qq_line(distribution = qnorm, dparams = c(musum, sigmasum), color = "red")
```

```{r}
#| eval: false
#Example 2: X[i] ~ a[i] + Binomial(n[i],p[i])
#Randomly choose the shift (a), flips (n), and probabilities of heads (p)
m = 25
a = sample(-10:10, m, replace=TRUE)
n = sample(20:50, m, replace=TRUE)
p = runif(m)

#Obtain the pdfs of the shifted binomials
dists = tibble()
for(i in 1:m) {
  dist = tibble(
    rv = i,
    x = a[i] + 0:n[i],
    pdf = dbinom(0:n[i], n[i], p[i])
  )
  dists = rbind(dists, dist)
}

#Plot the pdfs as line graphs for greater visibility 
ggplot(dists, aes(x = x, y = pdf, color = as.factor(rv))) +
  geom_line() + theme(legend.position = "none")

#Simulate the sum of the random variables
runs = 1000
sums = numeric(runs)
for (t in 1:runs) {
   xdata = numeric(m)
   for (i in 1:m) xdata[i] = a[i] + rbinom(1, n[i], p[i])
   sums[t] = sum(xdata)
}
simulation = tibble(sum = sums)

#Obtain the mean and sd of X[1] + X[2] + ... + X[m]
mu = a + n*p
sigma = sqrt(n*p*(1-p))
musum = sum(mu)
sigmasum = sqrt(sum(sigma^2))
```

```{r}
#| eval: false
#| echo: false
#Compare simulation and theory means and standard deviations
simulation |>
  summarise(mean(sum), sd(sum))
tibble(musum, sigmasum)

#Compute the claimed normal distribution
theory = tibble(
  x = seq(musum-3*sigmasum, musum+3*sigmasum, length.out=100),
  pdf = dnorm(x, musum, sigmasum), 
  cdf = pnorm(x, musum, sigmasum)
)

#Compare simulation and theoretical pdfs
ggplot(simulation, aes(x = sum, y = after_stat(density))) +
  geom_histogram(bins = 30, color = "white") +
  geom_density(linewidth = 2) +
  geom_line(aes(x = x, y = pdf), theory, color = "red", linewidth = 2)

#Compare simulation and theoretical cdfs
ggplot(simulation, aes(x = sum)) +
  stat_ecdf() +
  geom_line(aes(x = x, y = cdf), theory, color = "red")

#Compare simulation and theory with a quantile-quantile plot
ggplot(simulation, aes(sample = sum)) +
  stat_qq(distribution = qnorm, dparams = c(musum, sigmasum), alpha = 0.2) +
  geom_abline(slope = 1, intercept = 0, color = "red")
#Old incorrect code
#ggplot(simulation, aes(sample = sum)) +
#  stat_qq(distribution = qnorm, dparams = c(musum, sigmasum)) +
#  stat_qq_line(distribution = qnorm, dparams = c(musum, sigmasum), color = "red")
```